So far this has only been evident on OpenShift hosted on AWS platforms.


The openshift-machine-api has exhibited the ability to create but then fail verification of instances in AWS. This results in further requests of instances until one can be verified and associated with a machine and machineset in cluster.


Identify and remove orphaned instances.

Root Cause

Root cause is stil being verified within https://bugzilla.redhat.com/show_bug.cgi?id=2025767.

Diagnostic Steps

Querying prometheus for the below can indicate that pending CSRs can belong to orphaned instances.

sum(mapi_current_pending_csr) > 1

> 1 is used in regards to a metric reporting issue explained in https://access.redhat.com/solutions/6411541

If the above query returns > 1 , compare the current count of machines per machine set in cluster with number of instances in the clusters VPC.

Getting a list of all VMs in the AWS account that have a name matching that of the machineset.

aws ec2 describe-instances --filter Name=tag:Name,Values=$MACHINESETNAME* | jq -r '.Reservations[].Instances[].Tags[] | select(.Key=="Name") | .Value'

Getting a list of all machines belonging to that machineset in the cluster

oc get machines -n openshift-machine-api -l "machine.openshift.io/cluster-api-machineset"="$MACHINESETNAME" 

Comparing the two lists to find VMs which do not have a matching machine name in the cluster.

Pending CSRs can also identify orphaned instances.. Run the below command to identify the IP addresses.

$ oc get csr -o json  | jq '.items[].spec.request' -r|while read req; do base64 -d<<<$req | openssl req -text; done | grep "CN =" | sort | uniq
        Subject: O = system:nodes, CN = system:node:ip-10-112-61-167.ec2.internal
        Subject: O = system:nodes, CN = system:node:ip-10-112-61-177.ec2.internal

These instances can be deleted after confirming they are orphaned.

aws ec2 describe-instances --filter Name=tag:Name,Values=$VM_NAME | jq -r '.Reservations[].Instances[].InstanceId'

Verify instance

aws ec2 describe-instances --instance-ids $INSTANCE_ID

Then terminate the instance

aws ec2 terminate-instances --instance-ids $INSTANCE_ID

Once all the instance have been terminated, you can also remove any pending CSRs to resolve the alert:

oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs -r oc delete csr --ignore-not-found --wait=false

