Orphaned Instances not owned by machineset

Solution In Progress - Updated -

Environment

So far this has only been evident on OpenShift hosted on AWS platforms.

Issue

The openshift-machine-api has exhibited the ability to create but then fail verification of instances in AWS. This results in further requests of instances until one can be verified and associated with a machine and machineset in cluster.

Resolution

Identify and remove orphaned instances.

Root Cause

Root cause is stil being verified within https://bugzilla.redhat.com/show_bug.cgi?id=2025767.

Diagnostic Steps

Querying prometheus for the below can indicate that pending CSRs can belong to orphaned instances.

sum(mapi_current_pending_csr) > 1

> 1 is used in regards to a metric reporting issue explained in https://access.redhat.com/solutions/6411541

If the above query returns > 1 , compare the current count of machines per machine set in cluster with number of instances in the clusters VPC.

Getting a list of all VMs in the AWS account that have a name matching that of the machineset.

aws ec2 describe-instances --filter Name=tag:Name,Values=$MACHINESETNAME* | jq -r '.Reservations[].Instances[].Tags[] | select(.Key=="Name") | .Value'

Getting a list of all machines belonging to that machineset in the cluster

oc get machines -n openshift-machine-api -l "machine.openshift.io/cluster-api-machineset"="$MACHINESETNAME" 

Comparing the two lists to find VMs which do not have a matching machine name in the cluster.

Pending CSRs can also identify orphaned instances.. Run the below command to identify the IP addresses.

$ oc get csr -o json  | jq '.items[].spec.request' -r|while read req; do base64 -d<<<$req | openssl req -text; done | grep "CN =" | sort | uniq
        Subject: O = system:nodes, CN = system:node:ip-10-112-61-167.ec2.internal
        Subject: O = system:nodes, CN = system:node:ip-10-112-61-177.ec2.internal

These instances can be deleted after confirming they are orphaned.

aws ec2 describe-instances --filter Name=tag:Name,Values=$VM_NAME | jq -r '.Reservations[].Instances[].InstanceId'

Verify instance

aws ec2 describe-instances --instance-ids $INSTANCE_ID

Then terminate the instance

aws ec2 terminate-instances --instance-ids $INSTANCE_ID

Once all the instance have been terminated, you can also remove any pending CSRs to resolve the alert:

oc get csr -o go-template='{{range .items}}{{if not .status}}{{.metadata.name}}{{"\n"}}{{end}}{{end}}' | xargs -r oc delete csr --ignore-not-found --wait=false

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments