Unable to Delete a Project or Namespace in OCP

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP, OCP)
    • 3.11
    • 4.4+

Issue

  • oc delete project dev --force --grace-period=0 does not completely delete a project
  • "I am unable to delete a project"
  • The project is stuck in "Terminating" stage after deletion

Resolution

Please try to troubleshoot and delete remaining resources.
Do not force removals unless you know what you are doing.

Troubleshoot and delete remaining resources

This usually happens because something is preventing a resource from being deleted, causing namespace deletion to be stuck. It is necessary to troubleshoot which resources are failing to be deleted and why.

A good troubleshooting approach would be:

  • Check the output of command oc api-resources. If it fails, check Projects stuck in Terminating state and unable to run "oc api-resources" on OpenShift.
  • Try to list all the items in the namespace with the following command:

    oc api-resources --verbs=list --namespaced -o name | xargs -t -n 1 oc get --show-kind --ignore-not-found -n $PROJECT_NAME
    
  • If the previous one fails, please try this one, which might not return a complete list but has less chance to fail:

    oc api-resources --verbs=list --cached --namespaced -o name | xargs -t -n 1 oc get --show-kind --ignore-not-found -n $PROJECT_NAME
    
  • Try manually removing every listed resource and, if one fails, troubleshoot why.

  • If all the listed resources are removed, try listing resources in the namespace with etcd. If there are still present resources, troubleshoot why

    • In OCP 4, you can do it this way:

      [user@workstation ~]$ POD=`oc get pods -n openshift-etcd -o=jsonpath='{.items[0].metadata.name}'`
      [user@workstation ~]$  oc rsh -n openshift-etcd -c etcdctl $POD 
      sh-4.2# etcdctl get --keys-only --from-key / | grep $PROJECT_NAME
      
    • In OCP 3, you can do it in a master this way:

      etcdctl3 get --keys-only --from-key / | grep $PROJECT_NAME
      
  • A good way to start troubleshooting is master controller logs. If the resource is a CRD managed by an operator, troubleshoot that operator.

Important: In case of any issue, please open a support case to get assistance.

In many cases, deleting resources that could not be deleted in the first place ultimately leads to project to be no longer stuck in "Terminating" and be properly removed, but not always, so you may need to force its removal. It may be also possible that the project is waiting on the removal of an object with its own finalizer. If that is the case, and you are 100% sure of what you are doing, you can just remove the finalizer for that object. Both procedures are covered below, but please use them with caution and only if you know what you are doing.

Force individual object removal when it has finalizers - USE WITH CAUTION

Sometimes, a resource (specially a custom resource managed by an operator) may stay "terminating" waiting on a finalizer, although any needed cleanup tasks have been already completed, so it would be necessary to force its removal.

However, a very important warning: forcing the removal of an object without having properly cleaned it up may lead to unstable and unpredictable behavior, so you must be 100% sure this is not the case and open a support case if you have even a minimum doubt. The impact would depend on each operator and what object is affected, but can be potentially high.

Only if you know what you are doing and you are 100% sure that any cleanup tasks for the object have been properly completed but still need to force its removal, you do it this way:

$ oc patch -n <project-name> <object-kind>/<object-name> --type=merge -p '{"metadata": {"finalizers":null}}'

Force namespace removal - USE WITH CAUTION

Sometimes, when a project has been stuck in "Terminating" state, even if all the resources have been properly removed afterwards, namespace may remain stuck in that state forever, so it becomes necessary to force its removal.

However, a very important warning: forcing the removal of a namespace without having properly cleaned it up may lead to unstable and unpredictable cluster behavior, so you must be 100% sure this is not the case and open a support case if you have even a minimum doubt.

Only if you know what you are doing and you are 100% sure that you have properly cleaned up any resource from the namespace but still need to force its removal, you can follow these steps to do so:

  • Confirm which namespace needs to be removed with oc get namespace
  • Create a temporary .json file: oc get namespace <failing namespace> -o json > tmp.json
  • Edit the file with your favorite text editor vim tmp.json
    • Remove the kubernetes value from the finalizers field and save the file.
  • Your tmp.json file should look similar to this:

    {
      "apiVersion": "v1",
      "kind": "Namespace",
      "metadata": {
          "annotations": {
              "openshift.io/description": "",
              "openshift.io/display-name": "",
              "openshift.io/requester": "system:admin",
              "openshift.io/sa.scc.mcs": "s0:c16,c15",
              "openshift.io/sa.scc.supplemental-groups": "1000270000/10000",
              "openshift.io/sa.scc.uid-range": "1000270000/10000"
          },
          "creationTimestamp": "2020-04-27T08:35:29Z",
          "deletionTimestamp": "2020-04-27T09:07:22Z",
          "name": "test",
          "resourceVersion": "3480943",
          "selfLink": "/api/v1/namespaces/test",
          "uid": "0d2d425c-8862-11ea-bce9-fa163eb0b490"
      },
      "spec": {
          "finalizers": []
      },
      "status": {
          "phase": "Terminating"
      }
    }
    
  • Setup a temporary proxy, please keep this terminal open until the namespace is deleted.

    $ oc proxy
    
  • In a new terminal window, replace with the name of the failing project/namespace and enter the following:

    $  curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/${PROJECT_NAME}/finalize
    
  • If you get authorization errors, you can also try running this on a master using certificates without the proxy command above:

    # curl --cacert /etc/origin/master/ca.crt --key /etc/origin/master/admin.key --cert /etc/origin/master/admin.crt -k  -H "Content-Type: application/json" -X PUT --data-binary @tmp.json https://127.0.0.1:8443/api/v1/namespaces/<terminating-namespace>/finalize
    

Root Cause

Every Kubernetes namespace has a kubernetes finalizer that prevents its final deletion when a delete on that namespace is requested. Reason is so that masters can delete the resources in the namespace before deleting the namespace itself.

However, many different reasons can lead to some of this resources to not be properly deleted. A typical example is the failure of an external apiservice (like service catalog).

This solution provides general guidance on how to troubleshoot this kind of situations as well as a procedure to force namespace deletion or the deletion of individual objects that can also be stuck on finalizers. But those procedures must be used as a last resort.

Diagnostic Steps

To double-check whether a project is stuck "Terminating", you can do the following:

First, attempt running:
$ oc delete project <project name> --force --grace-period=0

If this comes back with output of:

warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (Conflict): Operation cannot be fulfilled on namespaces "<project>": The system is ensuring all content is removed from this namespace.  Upon completion, this namespace will automatically be purged by the system.`

Then run:
$ oc get project <project name> -o yaml

If you see the following at the bottom of the output:

spec:
  finalizers:
  - kubernetes
status:
  phase: Terminating

The 'kubernetes' finalizer is keeping the project from deletion. You need to troubleshoot why and fix as per "resolution" section.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

16 Comments

Getting error:

:namespaces \namespace\ is forbidden: User \"system:anonymous\" cannot update namespaces/finalize in the namespace \"jenkins\": no RBAC policy attached.

But when oc whoami command ran

cluster-admin

and then attempted from system:admin account

me too , thats why this solution unverified...

Isiaha , Ekky

That error means that the kube-apiserver is not acknowledging the token or client certificate your kubeconfig is using to demonstrate your identity, so it treats your oc client as anonymous user.

Running "oc whoami" and getting a "cluster-admin" answer means that your user's name is "cluster-admin".

If it failed for both, "cluster-admin" and "system:admin" users, then kubeconfig might be wrong.

IF you

this cmd was most usefull:

oc get project $PROJECT_NAME -o yaml

fyi, I had this due to stuck rolebindings.authorization.openshift.io / rolebinding.rbac.authorization.k8s.io.

due to kubernetes.io/iam.security.ibm.com it seems. https://bugzilla.redhat.com/show_bug.cgi?id=1932096 https://www.ibm.com/docs/en/cloud-paks/cp-applications/4.3?topic=troubleshooting-installation-issues#reinstallfails

Same here This command showed that there was a RoleBinding that was preventing the project from being removed. But deleting it even forced didn't work. What did work as explained was to set the finalizer to [] followed by a delete --force --grace-period=0 on the RoleBinding did removed it from the project and the project disappeared after. There was a bug filed for this almost exact situation, bug that was closed as not a bug.

https://bugzilla.redhat.com/show_bug.cgi?id=1932096

Thanks for the excellent article!

None solution worked in RH opentlc lab cee-cf-111.

A solution from RH can not end with the suggestion that leaves the user to do troubleshoot. The user did troubleshoot and it didn't work thus raising the issue to RH.

Article does not help too much. I have the issue on an OKD 4.9 cluster (on vSphere UPI). looking at the resources of one of the projects stuck in terminating mode gives:

Error from server (InternalError): Internal error occurred: error resolving resource

followed by a long list of packagemanifest.packages.operators.co manifest names from the operator catalog. No indication of the resource causing the error.

Looking at etcd, there are no resources associated with the project. Project status includes:

status:
  conditions:
  - lastTransitionTime: "2022-05-04T18:12:35Z"
    message: All resources successfully discovered
    reason: ResourcesDiscovered
    status: "False"
    type: NamespaceDeletionDiscoveryFailure
  - lastTransitionTime: "2021-12-22T00:02:14Z"
    message: All legacy kube types successfully parsed
    reason: ParsedGroupVersions
    status: "False"
    type: NamespaceDeletionGroupVersionParsingFailure
  - lastTransitionTime: "2021-12-22T00:02:14Z"
    message: 'Failed to delete all resource types, 1 remaining: Internal error occurred:
      error resolving resource'
    reason: ContentDeletionFailed
    status: "True"
    type: NamespaceDeletionContentFailure
  - lastTransitionTime: "2021-12-22T00:02:19Z"
    message: All content successfully removed
    reason: ContentRemoved
    status: "False"
    type: NamespaceContentRemaining
  - lastTransitionTime: "2021-12-22T00:02:15Z"
    message: All content-preserving finalizers finished
    reason: ContentHasNoFinalizers
    status: "False"
    type: NamespaceFinalizersRemaining
  phase: Terminating

The only suggestions left in this article are labeled USE WITH CAUTION.

Also the code snippet using etcd has an undefined environment variable ETCDCTL_COMMAND

The knowledge article needs more information on debugging and safely resolving the issue.

Hello,

The ETCDCTL_COMMAND thingy was a typo that has been fixed.

However, you seem to have a problem with an aggregated API server. As this may be a complex problem and OKD is not supported by Red Hat, I'd suggest you try to seek help from the OKD community: https://www.okd.io/help/

If you reproduce this on a supported Red Hat OpenShift Container Platform cluster, please open a support case for that cluster.

RE "OKD is not supported by Red Hat," yes, that is what we tell people.

As indicated in my post above, "oc get " returns an internal error resolving resource and does not name the offending resource (OCP bug?).

A helpful addition to this knowledge base solution might be to replace that oc get with a script that echo's the resource name, then calls oc get. This is what allowed me to track down the problem to a bad certificate field in a CRD.

Echoing the resource name for each invocation is a good idea, so I implemented it but in a simpler way: by adding -t to the xargs invocations, so that the invoked command lines are printed to stderr.

Regarding the internal error, that is likely to be a bug that may need to be triaged. So, if found in OKD, feel free to open an issue if you see you have the right information to do so. If found in RHOCP with a proper subscription, please open a support case.

Your solution is better than mine was, thanks,

OKD issue: https://github.com/openshift/okd/issues/1222

OCP issue: https://bugzilla.redhat.com/show_bug.cgi?id=2084960

As you indicate, there is some underlying problem that is likely more serious than the "oc get" issue. Hopefully someone will follow it furhter.

If you can reproduce this issue with Enterprise RHOCP and you have proper subscriptions, please open a support case. This kind of issues must be first triaged by Red Hat Support before opening a bugzilla to the product, as they may happen due to a problem in your cluster and not due to a bug in the product.

If you don't have an account with Red Hat and/or proper subscriptions for Red Hat OpenShift Container Platform, you would need then to stick to the OKD issue.

No.

I am following the proper procedures, as you will see by looking at the OKD issue and the OCP issue (linked above).

I am just trying to help get rid of one of the causes of projects/namespaces hanging in terminating mode (due in this case to an internal error when checking the project's resources).

However, as I can reproduce the issue on Minikube, it seems to be upstream from OCP.

I see there is some misconception about what is this article and what it goal is.

The goal of this article is to:

  • Give tips on how to troubleshoot a project stuck in "terminating" step, usually caused by one or more resources stuck in a terminating state: The reasons why this can happen are extremely varied, so only general guidelines can be provided. It is recommended to open a Support Case with Red Hat (or seek help from community for the case of OKD) if this is not enough.
  • Document how to delete an object stuck at a finalizer, which must be done with much caution: This is something that requires extremely advanced OCP knowledge, not because the procedure is complex but because you need to be very sure to understand what you are doing and, more important, if it is safe to let the resource be deleted from the API (which in most cases requires deep and specific knowledge about the stuck resource and whatever created or managed it).
  • Document how to delete a namespace stuck in finalizing state which must be done with much caution: Like in the previous point, this is something that requires extremely advanced OCP knowledge, not because the procedure is complex but because you need to be very sure to understand what you are doing and, more important, if all the resources related to the namespace have been safely removed without leaving any trash behind.

The "USE WITH CAUTION" warnings are there for a reason. Just as an example, as early versions of this article did not document these risks properly, I saw somebody forcing the removal of a project without all the services having been removed, so there were services in the API for a non-existing project, which in turn caused the OpenShift SDN to start misbehaving in a horrible manner that caused great cluster-wide impact.

This article is not direct and easy "I have a problem, I apply some steps, I get a solution" and doesn't intend to be that. There may be concrete situations that may cause stuck objects that may be identified in a such simple manner and each of them would deserve its own smaller solution. This is a generic reference on how to apply these "for very extreme emergency advanced procedures" and some generic guidelines on what to do before trying that.

With ServiceMesh, there are 2 resources that can not be terminated using oc patch -n / --type=merge -p '{"metadata": {"finalizers":null}}'

$ oc patch -n istio-system servicemeshcontrolplane.maistra.io/service-mesh-installation --type=merge -p '{"metadata": {"finalizers":null}}' Error from server (InternalError): Internal error occurred: failed calling webhook "smcp.mutation.maistra.io": Post "https://maistra-admission-controller.openshift-operators.svc:443/mutate-smcp?timeout=10s": no endpoints available for service "maistra-admission-controller"

$ oc get servicemeshcontrolplane.maistra.io/service-mesh-installation NAME READY STATUS PROFILES VERSION AGE service-mesh-installation 9/9 ComponentsReady ["default"] 2.0.8 198d

$ oc delete servicemeshcontrolplane.maistra.io/service-mesh-installation servicemeshcontrolplane.maistra.io "service-mesh-installation" deleted

is also hanged