13.5. 关于 ClusterGroupUpgrade CR
Topology Aware Lifecycle Manager(TALM)为一组集群从 ClusterGroupUpgrade CR 构建补救计划。您可以在 ClusterGroupUpgrade CR 中定义以下规格:
- 组中的集群
-
阻塞
ClusterGroupUpgradeCR - 适用的受管策略列表
- 并发更新数
- 适用的 Canary 更新
- 更新前和更新之后执行的操作
- 更新数据
您可以使用 ClusterGroupUpgrade CR 中的 enable 字段控制更新的开始时间。例如,如果您调度的维护窗口为 4 小时,您可以准备 ClusterGroupUpgrade CR,并将 enable 字段设置为 false。
您可以通过配置 spec.remediationStrategy.timeout 设置来设置超时,如下所示:
spec
remediationStrategy:
maxConcurrency: 1
timeout: 240
您可以使用 batchTimeoutAction 来确定更新是否有集群发生的情况。您可以指定 continue 跳过失败的集群,并继续升级其他集群,或 abort 以停止所有集群的策略补救。超时后,TALM 删除所有 enforce 策略,以确保对集群不进行进一步的更新。
要应用更改,您可以将 enabled 字段设置为 true。
如需更多信息,请参阅"将更新策略应用到受管集群"部分。
当 TALM 通过对指定集群进行补救时,ClusterGroupUpgrade CR 可以为多个条件报告 true 或 false 状态。
当 TALM 完成集群更新后,集群不会在同一 ClusterGroupUpgrade CR 控制下再次更新。在以下情况下,必须创建新的 ClusterGroupUpgrade CR:
- 当您需要再次更新集群时
-
当集群在更新后变为与
inform策略不符合时
13.5.1. 选择集群
TALM 构建补救计划并根据以下字段选择集群:
-
clusterLabelSelector字段指定您要更新的集群标签。这由来自k8s.io/apimachinery/pkg/apis/meta/v1的标准标签选择器的列表组成。列表中的每个选择器都使用标签值对或标签表达式。来自每个选择器的匹配会添加到集群的最终列表中,以及来自clusterSelector字段和cluster字段的匹配项。 -
clusters字段指定要更新的集群列表。 -
canaries字段指定集群进行 Canary 更新。 -
maxConcurrency字段指定批处理中要更新的集群数量。 -
actions字段指定 TALM 在启动更新过程时执行的beforeEnable操作,以及 TALM 在完成每个集群策略补救时执行的afterCompletion操作。
您可以使用 clusters、clusterLabelSelector 和 clusterSelector 字段来创建组合的集群列表。
补救计划从 canaries 字段中列出的集群开始。每个 canary 集群组成一个集群批处理。
Sample ClusterGroupUpgrade CR,带有 the enabled field 设置为 false
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
creationTimestamp: '2022-11-18T16:27:15Z'
finalizers:
- ran.openshift.io/cleanup-finalizer
generation: 1
name: talm-cgu
namespace: talm-namespace
resourceVersion: '40451823'
uid: cca245a5-4bca-45fa-89c0-aa6af81a596c
Spec:
actions:
afterCompletion: 1
addClusterLabels:
upgrade-done: ""
deleteClusterLabels:
upgrade-running: ""
deleteObjects: true
beforeEnable: 2
addClusterLabels:
upgrade-running: ""
backup: false
clusters: 3
- spoke1
enable: false 4
managedPolicies: 5
- talm-policy
preCaching: false
remediationStrategy: 6
canaries: 7
- spoke1
maxConcurrency: 2 8
timeout: 240
clusterLabelSelectors: 9
- matchExpressions:
- key: label1
operator: In
values:
- value1a
- value1b
batchTimeoutAction: 10
status: 11
computedMaxConcurrency: 2
conditions:
- lastTransitionTime: '2022-11-18T16:27:15Z'
message: All selected clusters are valid
reason: ClusterSelectionCompleted
status: 'True'
type: ClustersSelected 12
- lastTransitionTime: '2022-11-18T16:27:15Z'
message: Completed validation
reason: ValidationCompleted
status: 'True'
type: Validated 13
- lastTransitionTime: '2022-11-18T16:37:16Z'
message: Not enabled
reason: NotEnabled
status: 'False'
type: Progressing
managedPoliciesForUpgrade:
- name: talm-policy
namespace: talm-namespace
managedPoliciesNs:
talm-policy: talm-namespace
remediationPlan:
- - spoke1
- - spoke2
- spoke3
status:
- 1
- 指定 TALM 完成每个集群的策略补救时执行的操作。
- 2
- 指定 TALM 在开始更新过程时执行的操作。
- 3
- 定义要更新的集群列表。
- 4
enable字段设置为false。- 5
- 列出要修复的用户定义的策略集合。
- 6
- 定义集群更新的具体信息。
- 7
- 定义可用于 canary 更新的集群。
- 8
- 定义批处理中的最大并发更新数。补救批处理数量是 canary 集群的数量,加上除 Canary 集群外的集群数量除以
maxConcurrency值。已兼容所有受管策略的集群不包括在补救计划中。 - 9
- 显示选择集群的参数。
- 10
- 控制批处理超时时会发生什么。可能的值有
abort或continue。如果未指定,则默认为continue。 - 11
- 显示更新状态的信息。
- 12
ClustersSelected条件显示所有选择的集群有效。- 13
Validated条件显示所有选择的集群都已验证。
在更新 canary 集群的过程中任何错误都会停止更新过程。
当成功创建补救计划时,您可以将 enable 字段设置为 true,TALM 会开始使用指定的受管策略更新不合规的集群。
只有 ClusterGroupUpgrade CR 的 enable 字段设置为 false 时,才能更改 spec 字段。
13.5.2. 验证
TALM 检查所有指定的受管策略是否可用并正确,并使用 Validated 条件来报告状态和原因:
true验证已完成。
false策略缺失或无效,或者指定了无效的平台镜像。
13.5.3. 预缓存
集群可能具有有限的带宽来访问容器镜像 registry,这可能会在更新完成前造成超时。在单节点 OpenShift 集群中,您可以使用预缓存来避免这种情况。当创建 ClusterGroupUpgrade CR 时,容器镜像预缓存会启动,并将 preCaching 字段设置为 true。TALM 将可用磁盘空间与预计 OpenShift Container Platform 镜像大小进行比较,以确保有足够的空间。如果集群没有足够的空间,TALM 会取消该集群的预缓存,且不会修复其上的策略。
TALM 使用 PrecacheSpecValid 条件来报告状态信息,如下所示:
true预缓存规格有效且一致。
false预缓存规格不完整。
TALM 使用 PrecachingSucceeded 条件来报告状态信息,如下所示:
trueTALM 已完成预缓存过程。如果任何集群的预缓存失败,则该集群的更新会失败,但会继续执行所有其他集群。如果任何集群预缓存失败,您会接收到一个通知信息。
false预缓存仍在为一个或多个集群处理,或者所有集群都失败。
如需更多信息,请参阅"使用容器镜像预缓存功能"部分。
13.5.4. 创建备份
对于单节点 OpenShift,TALM 可以在更新前创建部署的备份。如果更新失败,您可以恢复之前的版本并将集群恢复到工作状态,而无需重新置备应用程序。要使用备份功能,您首先创建一个 ClusterGroupUpgrade CR,并将 backup 字段设置为 true。为确保备份内容为最新版本,在 ClusterGroupUpgrade CR 中的 enable 字段设置为 true 之前,不会进行备份。
TALM 使用 BackupSucceeded 条件来报告状态,如下所示:
true备份对于所有集群都完成,或备份运行已完成但对一个或多个集群失败。如果任何集群的备份失败,则该集群的更新会失败,但会继续执行所有其他集群。
false备份仍在为一个或多个集群处理,或者所有集群都失败。
如需更多信息,请参阅"在升级前创建集群资源备份"部分。
13.5.5. 更新集群
TALM 按照补救计划强制实施策略。在当前批处理的所有集群与所有受管策略兼容后,对后续批处理的策略强制启动。如果批处理超时,TALM 会进入下一个批处理。批处理的超时值是 spec.timeout 字段除以补救计划中的批处理数量。
TALM 使用 Progressing 条件来报告状态以及如下原因:
trueTALM 是补救不合规的策略。
false更新没有进行。可能的原因包括:
- 所有集群都符合所有受管策略。
- 当策略补救用时过长时,更新会超时。
- 阻塞系统中丢失或尚未完成的 CR。
-
ClusterGroupUpgradeCR 不会被启用。 - 备份仍在进行中。
受管策略会按照 ClusterGroupUpgrade CR 中的 managedPolicies 字段中列出的顺序进行应用。一个受管策略被应用于指定的集群。当集群符合当前策略时,会应用下一个受管策略。
处于 Progressing 状态的 ClusterGroupUpgrade CR 示例
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
creationTimestamp: '2022-11-18T16:27:15Z'
finalizers:
- ran.openshift.io/cleanup-finalizer
generation: 1
name: talm-cgu
namespace: talm-namespace
resourceVersion: '40451823'
uid: cca245a5-4bca-45fa-89c0-aa6af81a596c
Spec:
actions:
afterCompletion:
deleteObjects: true
beforeEnable: {}
backup: false
clusters:
- spoke1
enable: true
managedPolicies:
- talm-policy
preCaching: true
remediationStrategy:
canaries:
- spoke1
maxConcurrency: 2
timeout: 240
clusterLabelSelectors:
- matchExpressions:
- key: label1
operator: In
values:
- value1a
- value1b
batchTimeoutAction:
status:
clusters:
- name: spoke1
state: complete
computedMaxConcurrency: 2
conditions:
- lastTransitionTime: '2022-11-18T16:27:15Z'
message: All selected clusters are valid
reason: ClusterSelectionCompleted
status: 'True'
type: ClustersSelected
- lastTransitionTime: '2022-11-18T16:27:15Z'
message: Completed validation
reason: ValidationCompleted
status: 'True'
type: Validated
- lastTransitionTime: '2022-11-18T16:37:16Z'
message: Remediating non-compliant policies
reason: InProgress
status: 'True'
type: Progressing 1
managedPoliciesForUpgrade:
- name: talm-policy
namespace: talm-namespace
managedPoliciesNs:
talm-policy: talm-namespace
remediationPlan:
- - spoke1
- - spoke2
- spoke3
status:
currentBatch: 2
currentBatchRemediationProgress:
spoke2:
state: Completed
spoke3:
policyIndex: 0
state: InProgress
currentBatchStartedAt: '2022-11-18T16:27:16Z'
startedAt: '2022-11-18T16:27:15Z'
- 1
Progressing字段显示 TALM 处于补救策略的过程。
13.5.6. 更新状态
TALM 使用 Succeeded 条件来报告状态和如下原因:
true所有集群都符合指定的受管策略。
false因为没有集群可用于补救,策略补救会失败,或者因为以下原因之一策略补救用时过长:
- 在当前批处理包含 Canary 更新时,批处理中的集群不会遵循批处理超时中的所有受管策略。
-
集群不符合
remediationStrategy字段中指定的timeout值的受管策略。
处于 Succeeded 状态的 ClusterGroupUpgrade CR 示例
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
name: cgu-upgrade-complete
namespace: default
spec:
clusters:
- spoke1
- spoke4
enable: true
managedPolicies:
- policy1-common-cluster-version-policy
- policy2-common-pao-sub-policy
remediationStrategy:
maxConcurrency: 1
timeout: 240
status: 1
clusters:
- name: spoke1
state: complete
- name: spoke4
state: complete
conditions:
- message: All selected clusters are valid
reason: ClusterSelectionCompleted
status: "True"
type: ClustersSelected
- message: Completed validation
reason: ValidationCompleted
status: "True"
type: Validated
- message: All clusters are compliant with all the managed policies
reason: Completed
status: "False"
type: Progressing 2
- message: All clusters are compliant with all the managed policies
reason: Completed
status: "True"
type: Succeeded 3
managedPoliciesForUpgrade:
- name: policy1-common-cluster-version-policy
namespace: default
- name: policy2-common-pao-sub-policy
namespace: default
remediationPlan:
- - spoke1
- - spoke4
status:
completedAt: '2022-11-18T16:27:16Z'
startedAt: '2022-11-18T16:27:15Z'
timedout 状态的 Sample ClusterGroupUpgrade CR
apiVersion: ran.openshift.io/v1alpha1
kind: ClusterGroupUpgrade
metadata:
creationTimestamp: '2022-11-18T16:27:15Z'
finalizers:
- ran.openshift.io/cleanup-finalizer
generation: 1
name: talm-cgu
namespace: talm-namespace
resourceVersion: '40451823'
uid: cca245a5-4bca-45fa-89c0-aa6af81a596c
spec:
actions:
afterCompletion:
deleteObjects: true
beforeEnable: {}
backup: false
clusters:
- spoke1
- spoke2
enable: true
managedPolicies:
- talm-policy
preCaching: false
remediationStrategy:
maxConcurrency: 2
timeout: 240
status:
clusters:
- name: spoke1
state: complete
- currentPolicy: 1
name: talm-policy
status: NonCompliant
name: spoke2
state: timedout
computedMaxConcurrency: 2
conditions:
- lastTransitionTime: '2022-11-18T16:27:15Z'
message: All selected clusters are valid
reason: ClusterSelectionCompleted
status: 'True'
type: ClustersSelected
- lastTransitionTime: '2022-11-18T16:27:15Z'
message: Completed validation
reason: ValidationCompleted
status: 'True'
type: Validated
- lastTransitionTime: '2022-11-18T16:37:16Z'
message: Policy remediation took too long
reason: TimedOut
status: 'False'
type: Progressing
- lastTransitionTime: '2022-11-18T16:37:16Z'
message: Policy remediation took too long
reason: TimedOut
status: 'False'
type: Succeeded 2
managedPoliciesForUpgrade:
- name: talm-policy
namespace: talm-namespace
managedPoliciesNs:
talm-policy: talm-namespace
remediationPlan:
- - spoke1
- spoke2
status:
startedAt: '2022-11-18T16:27:15Z'
completedAt: '2022-11-18T20:27:15Z'
13.5.7. 阻塞 ClusterGroupUpgrade CR
您可以创建多个 ClusterGroupUpgrade CR,并控制应用程序的顺序。
例如,如果您创建了 ClusterGroupUpgrade CR C,它会阻塞 ClusterGroupUpgrade CR A 的启动,那么 ClusterGroupUpgrade CR A 将无法启动,直到 ClusterGroupUpgrade CR C 变为 UpgradeComplete 状态。
一个 ClusterGroupUpgrade CR 可以有多个阻塞 CR。在这种情况下,所有块 CR 都必须在升级当前 CR 升级前完成。
先决条件
- 安装 Topology Aware Lifecycle Manager(TALM)。
- 置备一个或多个受管集群。
-
以具有
cluster-admin特权的用户身份登录。 - 在 hub 集群中创建 RHACM 策略。
流程
将
ClusterGroupUpgradeCR 的内容保存到cgu-a.yaml、cgu-b.yaml和cgu-c.yaml文件中。apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-a namespace: default spec: blockingCRs: 1 - name: cgu-c namespace: default clusters: - spoke1 - spoke2 - spoke3 enable: false managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy remediationStrategy: canaries: - spoke1 maxConcurrency: 2 timeout: 240 status: conditions: - message: The ClusterGroupUpgrade CR is not enabled reason: UpgradeNotStarted status: "False" type: Ready copiedPolicies: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default - name: policy3-common-ptp-sub-policy namespace: default placementBindings: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy placementRules: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy remediationPlan: - - spoke1 - - spoke2- 1
- 定义阻塞 CR。
cgu-a更新无法启动,直到cgu-c完成后。
apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-b namespace: default spec: blockingCRs: 1 - name: cgu-a namespace: default clusters: - spoke4 - spoke5 enable: false managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: conditions: - message: The ClusterGroupUpgrade CR is not enabled reason: UpgradeNotStarted status: "False" type: Ready copiedPolicies: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default - name: policy3-common-ptp-sub-policy namespace: default - name: policy4-common-sriov-sub-policy namespace: default placementBindings: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy placementRules: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy remediationPlan: - - spoke4 - - spoke5 status: {}- 1
cgu-b更新无法启动,直到cgu-a完成后。
apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-c namespace: default spec: 1 clusters: - spoke6 enable: false managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: conditions: - message: The ClusterGroupUpgrade CR is not enabled reason: UpgradeNotStarted status: "False" type: Ready copiedPolicies: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy managedPoliciesCompliantBeforeUpgrade: - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy4-common-sriov-sub-policy namespace: default placementBindings: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy placementRules: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy remediationPlan: - - spoke6 status: {}- 1
cgu-c更新没有任何阻塞 CR。当enable字段设为true时,TALM 会启动cgu-c更新。
通过为每个相关 CR 运行以下命令创建
ClusterGroupUpgradeCR:$ oc apply -f <name>.yaml
通过为每个相关 CR 运行以下命令启动更新过程:
$ oc --namespace=default patch clustergroupupgrade.ran.openshift.io/<name> \ --type merge -p '{"spec":{"enable":true}}'以下示例显示
enable字段设为true的ClusterGroupUpgradeCR:带有阻塞 CR 的
cgu-a示例apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-a namespace: default spec: blockingCRs: - name: cgu-c namespace: default clusters: - spoke1 - spoke2 - spoke3 enable: true managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy remediationStrategy: canaries: - spoke1 maxConcurrency: 2 timeout: 240 status: conditions: - message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet completed: [cgu-c]' 1 reason: UpgradeCannotStart status: "False" type: Ready copiedPolicies: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default - name: policy3-common-ptp-sub-policy namespace: default placementBindings: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy placementRules: - cgu-a-policy1-common-cluster-version-policy - cgu-a-policy2-common-pao-sub-policy - cgu-a-policy3-common-ptp-sub-policy remediationPlan: - - spoke1 - - spoke2 status: {}- 1
- 显示阻塞 CR 的列表。
带有阻塞 CR 的
cgu-b示例apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-b namespace: default spec: blockingCRs: - name: cgu-a namespace: default clusters: - spoke4 - spoke5 enable: true managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: conditions: - message: 'The ClusterGroupUpgrade CR is blocked by other CRs that have not yet completed: [cgu-a]' 1 reason: UpgradeCannotStart status: "False" type: Ready copiedPolicies: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy2-common-pao-sub-policy namespace: default - name: policy3-common-ptp-sub-policy namespace: default - name: policy4-common-sriov-sub-policy namespace: default placementBindings: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy placementRules: - cgu-b-policy1-common-cluster-version-policy - cgu-b-policy2-common-pao-sub-policy - cgu-b-policy3-common-ptp-sub-policy - cgu-b-policy4-common-sriov-sub-policy remediationPlan: - - spoke4 - - spoke5 status: {}- 1
- 显示阻塞 CR 的列表。
带有阻塞 CR 的
cgu-c示例apiVersion: ran.openshift.io/v1alpha1 kind: ClusterGroupUpgrade metadata: name: cgu-c namespace: default spec: clusters: - spoke6 enable: true managedPolicies: - policy1-common-cluster-version-policy - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy - policy4-common-sriov-sub-policy remediationStrategy: maxConcurrency: 1 timeout: 240 status: conditions: - message: The ClusterGroupUpgrade CR has upgrade policies that are still non compliant 1 reason: UpgradeNotCompleted status: "False" type: Ready copiedPolicies: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy managedPoliciesCompliantBeforeUpgrade: - policy2-common-pao-sub-policy - policy3-common-ptp-sub-policy managedPoliciesForUpgrade: - name: policy1-common-cluster-version-policy namespace: default - name: policy4-common-sriov-sub-policy namespace: default placementBindings: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy placementRules: - cgu-c-policy1-common-cluster-version-policy - cgu-c-policy4-common-sriov-sub-policy remediationPlan: - - spoke6 status: currentBatch: 1 remediationPlanForBatch: spoke6: 0- 1
cgu-c更新没有任何阻塞 CR。