OpenShift on OpenStack with Availability Zones: Invalid Compute ServerGroup setup during OpenShift deployment

Solution Verified - Updated -

Environment

  • OpenShift on OpenStack IPI
  • Version of the initial cluster deployment is inferior to 4.14
  • Masters deployed with explicitly given Availability Zones (via zones) in install-config.yaml

Issue

Two of the three masters have an invalid ServerGroup in their Machine ProviderSpec.

Resolution

In the bug resolution, the Installer v4.14+ is correctly setting the same server group for all masters, regardless of the number of availability zones.

Note that in this environment (multiple Control plane AZs), the proposed solution is strictly incompatible with the “affinity” policy.

For clusters that were deployed before the 4.14 with the IPI method and with masters deployed on multiple availability zones, you need to manually update the Machine resources so that they reflect the actual state of the instances.

Note: editing the Machine resources will not trigger a rollout of the Control plane instances, because in-place edits of the Machine resources are not acted upon by any OpenShift operator. However, this in-place edit is necessary in order for the cluster-control-plane-machine-set-operator to correctly generate a ControlPlaneMachineSet for your cluster.

To do that, edit the ProviderSpec of both master-1 and master-2, and set the property serverGroupName of spec.providerSpec to the value of master-0’s spec.providerSpec.serverGroupName:

oc edit machine/<cluster_id>-master-1 -n openshift-machine-api
<make edits>
oc edit machine/<cluster_id>-master-2 -n openshift-machine-api
<make edits>

Here is an example of a providerSpec:

providerSpec:
  value:
    apiVersion: machine.openshift.io/v1alpha1
    availabilityZone: az0
      cloudName: openstack
    cloudsSecret:
      name: openstack-cloud-credentials
      namespace: openshift-machine-api
    flavor: m1.xlarge
    image: rhcos-4.14
    kind: OpenstackProviderSpec
    metadata:
      creationTimestamp: null
    networks:    
    - filter: {}
      subnets:  
      - filter:
          name: refarch-lv7q9-nodes
          tags: openshiftClusterID=refarch-lv7q9
    securityGroups:
    - filter: {}
      name: refarch-lv7q9-master
    serverGroupName: refarch-lv7q9-master-az0 <---- CHANGE ME
    serverMetadata:
      Name: refarch-lv7q9-master
      openshiftClusterID: refarch-lv7q9
    tags:
    - openshiftClusterID=refarch-lv7q9
    trunk: true
    userDataSecret:
      name: master-user-data

In case you edited or recreated your Control Plane Machine resources after install, you will have to adapt these steps to your situation. In your OpenStack cluster, find the server group your Control plane instances are in and set it in the ServerGroupName property of all three Control Plane Machines.

Once all the three Control plane Machine resources have the same correct ServerGroupName, your control plane is ready to be managed by the Cluster Control Plane Machine Set Operator and a ControlPlaneMachineSet (CPMS) will be created.

It'll be up to the user to review the generated CPMS and edit its state to Active when ready:

oc describe controlplanemachineset.machine.openshift.io/cluster --namespace openshift-machine-api
oc edit controlplanemachineset.machine.openshift.io/cluster --namespace openshift-machine-api

Root Cause

If the masters are configured with Availability Zones (AZ), the installer (via Terraform) will create one ServerGroup in OpenStack (the one initially created for master-0, ending with the name of the AZ) but configure the Machine ProviderSpec with different ServerGroups, one per AZ.

For example: given an install-config.yaml with three zones in the ControlPlane machine-pool, the Installer creates all three Nova instances in the same server group, and at the same time generates each Machine resource has a different value in ServerGroupName. The name of the actual server group in OpenStack is the one in the master-0 Machine resource; master-1 and master-2 each have bogus values for the ServerGroupName property.

This anomaly was reported as OCPBUGS-13300.

Diagnostic Steps

To check whether the masters have different ServerGroupNames, you can run:

oc get -n openshift-machine-api machine -o json | jq -r '.items[] | select(.metadata.labels["machine.openshift.io/cluster-api-machine-role"] == "master") | [.metadata.name, .spec.providerSpec.value.serverGroupName] | join(": ")'

If you don’t obtain a unique name (which happens if your cluster is older than 4.14 and you’re upgrading to OCP 4.14 with an IPI cluster with masters deployed on Availability Zones), you'll see that error in the control-plane-machine-set-operator logs in the openshift-machine-api namespace:

controller.go:329  "msg"="Reconciler error" "error"="error reconciling control plane machine set: unable to generate control plane machine set: unable to generate control plane machine set spec: failed to check OpenStack machines ServerGroup: machine refarch-lv7q9-master-1 has a different ServerGroup than the newest machine" 

At this stage, no ControlPlaneMachineSet (CPMS) was created but the operator is healthy and ready.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments