第 8 章 替换 DistributedComputeHCI 节点

在硬件维护期间,您可能需要在边缘站点缩减、扩展或替换 DistributedComputeHCI 节点。要替换 DistributedComputeHCI 节点,请从要替换的节点中删除服务,扩展节点数量,然后按照流程重新扩展这些节点。

8.1. 删除 Red Hat Ceph Storage 服务

在从集群中删除 HCI (超融合)节点前,您必须删除 Red Hat Ceph Storage 服务。要删除 Red Hat Ceph 服务,您必须从您要删除的节点上的集群服务禁用和移除 ceph-osd 服务,然后停止并禁用 monmgrosd 服务。

流程

  1. 在 undercloud 上,使用 SSH 连接到您要删除的 DistributedComputeHCI 节点:

    $ ssh tripleo-admin@<dcn-computehci-node>
  2. 启动 cephadm shell。为要删除的主机使用配置文件和密钥环文件:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
  3. 记录与您要删除的 DistributedComputeHCI 节点关联的 OSD (对象存储设备),以便在以后的步骤中使用引用:

    [ceph: root@dcn2-computehci2-1 ~]# ceph osd tree -c /etc/ceph/dcn2.conf
    …
    -3       0.24399     host dcn2-computehci2-1
     1   hdd 0.04880         osd.1                           up  1.00000 1.00000
     7   hdd 0.04880         osd.7                           up  1.00000 1.00000
    11   hdd 0.04880         osd.11                          up  1.00000 1.00000
    15   hdd 0.04880         osd.15                          up  1.00000 1.00000
    18   hdd 0.04880         osd.18                          up  1.00000 1.00000
    …
  4. 使用 SSH 连接到同一集群中的另一节点,并从集群中移除该监控器:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
    
    [ceph: root@dcn-computehci2-0]# ceph mon remove dcn2-computehci2-1 -c /etc/ceph/dcn2.conf
    removing mon.dcn2-computehci2-1 at [v2:172.23.3.153:3300/0,v1:172.23.3.153:6789/0], there will be 2 monitors
  5. 使用 SSH 再次登录到您要从集群中删除的节点。
  6. 停止并禁用 mgr 服务:

    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph
    ceph-crash@dcn2-computehci2-1.service    loaded active     running       Ceph crash dump collector
    ceph-mgr@dcn2-computehci2-1.service      loaded active     running       Ceph Manager
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl stop ceph-mgr@dcn2-computehci2-1
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl --type=service | grep ceph
    ceph-crash@dcn2-computehci2-1.service  loaded active running Ceph crash dump collector
    
    [tripleo-admin@dcn2-computehci2-1 ~]$ sudo systemctl disable ceph-mgr@dcn2-computehci2-1
    Removed /etc/systemd/system/multi-user.target.wants/ceph-mgr@dcn2-computehci2-1.service.
  7. 启动 cephadm shell:

    $ sudo cephadm shell --config /etc/ceph/dcn2.conf \
    --keyring /etc/ceph/dcn2.client.admin.keyring
  8. 验证节点的 mgr 服务是否已从集群中移除:

    [ceph: root@dcn2-computehci2-1 ~]# ceph -s
    
    cluster:
        id:     b9b53581-d590-41ac-8463-2f50aa985001
        health: HEALTH_WARN
                3 pools have too many placement groups
                mons are allowing insecure global_id reclaim
    
      services:
        mon: 2 daemons, quorum dcn2-computehci2-2,dcn2-computehci2-0 (age 2h)
        mgr: dcn2-computehci2-2(active, since 20h), standbys: dcn2-computehci2-0 1
        osd: 15 osds: 15 up (since 3h), 15 in (since 3h)
    
      data:
        pools:   3 pools, 384 pgs
        objects: 32 objects, 88 MiB
        usage:   16 GiB used, 734 GiB / 750 GiB avail
        pgs:     384 active+clean
    1
    当 mgr 服务被成功移除时,将不再列出 mgr 服务的节点。
  9. 导出 Red Hat Ceph Storage 规格:

    [ceph: root@dcn2-computehci2-1 ~]# ceph orch ls --export > spec.yml
  10. 编辑 spec.yaml 文件中的规格:

    • 从 spec.yml 中删除主机 <dcn-computehci-node> 的所有实例
    • 从以下内容中删除 <dcn-computehci-node> 条目的所有实例:

      • service_type: osd
      • service_type: mon
      • service_type: host
  11. 重新应用 Red Hat Ceph Storage 规格:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch apply -i spec.yml
  12. 删除您使用 ceph osd tree 识别的 OSD:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm --zap 1 7 11 15 18
    Scheduled OSD(s) for removal
  13. 验证正在移除的 OSD 的状态。不要继续,直到以下命令没有返回输出:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch osd rm status
    OSD_ID  HOST                    STATE     PG_COUNT  REPLACE  FORCE  DRAIN_STARTED_AT
    1       dcn2-computehci2-1      draining  27        False    False  2021-04-23 21:35:51.215361
    7       dcn2-computehci2-1      draining  8         False    False  2021-04-23 21:35:49.111500
    11      dcn2-computehci2-1      draining  14        False    False  2021-04-23 21:35:50.243762
  14. 验证您要删除的主机上没有保留守护进程:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch ps dcn2-computehci2-1

    如果守护进程仍然存在,您可以使用以下命令删除它们:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch host drain dcn2-computehci2-1
  15. 从 Red Hat Ceph Storage 集群中删除 <dcn-computehci-node> 主机:

    [ceph: root@dcn2-computehci2-1 /]# ceph orch host rm dcn2-computehci2-1
    Removed host ‘dcn2-computehci2-1’