Unable to Delete a Project or Namespace in OCP
Environment
- Red Hat OpenShift Container Platform (RHOCP, OCP)
- 3.11
- 4.4+
Issue
oc delete project dev --force --grace-period=0
does not completely delete a project- "I am unable to delete a project"
- The project is stuck in "Terminating" stage after deletion
Resolution
Please try to troubleshoot and delete remaining resources.
Do not force removals unless you know what you are doing.
Troubleshoot and delete remaining resources
This usually happens because something is preventing a resource from being deleted, causing namespace deletion to be stuck. It is necessary to troubleshoot which resources are failing to be deleted and why.
A good troubleshooting approach would be:
- Check the output of command
oc api-resources
. If it fails, check Projects stuck in Terminating state and unable to run "oc api-resources" on OpenShift. -
Try to list all the items in the namespace with the following command:
oc api-resources --verbs=list --namespaced -o name | xargs -t -n 1 oc get --show-kind --ignore-not-found -n $PROJECT_NAME
-
If the previous one fails, please try this one, which might not return a complete list but has less chance to fail:
oc api-resources --verbs=list --cached --namespaced -o name | xargs -t -n 1 oc get --show-kind --ignore-not-found -n $PROJECT_NAME
-
Try manually removing every listed resource and, if one fails, troubleshoot why.
-
If all the listed resources are removed, try listing resources in the namespace with etcd. If there are still present resources, troubleshoot why
-
In OCP 4, you can do it this way:
[user@workstation ~]$ POD=`oc get pods -n openshift-etcd -o=jsonpath='{.items[0].metadata.name}'` [user@workstation ~]$ oc rsh -n openshift-etcd -c etcdctl $POD sh-4.2# etcdctl get --keys-only --from-key / | grep $PROJECT_NAME
-
In OCP 3, you can do it in a master this way:
etcdctl3 get --keys-only --from-key / | grep $PROJECT_NAME
-
-
A good way to start troubleshooting is master controller logs. If the resource is a CRD managed by an operator, troubleshoot that operator.
Important: In case of any issue, please open a support case to get assistance.
In many cases, deleting resources that could not be deleted in the first place ultimately leads to project to be no longer stuck in "Terminating" and be properly removed, but not always, so you may need to force its removal. It may be also possible that the project is waiting on the removal of an object with its own finalizer. If that is the case, and you are 100% sure of what you are doing, you can just remove the finalizer for that object. Both procedures are covered below, but please use them with caution and only if you know what you are doing.
Force individual object removal when it has finalizers - USE WITH CAUTION
Sometimes, a resource (specially a custom resource managed by an operator) may stay "terminating" waiting on a finalizer, although any needed cleanup tasks have been already completed, so it would be necessary to force its removal.
However, a very important warning: forcing the removal of an object without having properly cleaned it up may lead to unstable and unpredictable behavior, so you must be 100% sure this is not the case and open a support case if you have even a minimum doubt. The impact would depend on each operator and what object is affected, but can be potentially high.
Only if you know what you are doing and you are 100% sure that any cleanup tasks for the object have been properly completed but still need to force its removal, you do it this way:
$ oc patch -n <project-name> <object-kind>/<object-name> --type=merge -p '{"metadata": {"finalizers":null}}'
Force namespace removal - USE WITH CAUTION
Sometimes, when a project has been stuck in "Terminating" state, even if all the resources have been properly removed afterwards, namespace may remain stuck in that state forever, so it becomes necessary to force its removal.
However, a very important warning: forcing the removal of a namespace without having properly cleaned it up may lead to unstable and unpredictable cluster behavior, so you must be 100% sure this is not the case and open a support case if you have even a minimum doubt.
Only if you know what you are doing and you are 100% sure that you have properly cleaned up any resource from the namespace but still need to force its removal, you can follow these steps to do so:
- Confirm which namespace needs to be removed with
oc get namespace
- Create a temporary .json file:
oc get namespace <failing namespace> -o json > tmp.json
- Edit the file with your favorite text editor
vim tmp.json
- Remove the
kubernetes
value from thefinalizers
field and save the file.
- Remove the
-
Your tmp.json file should look similar to this:
{ "apiVersion": "v1", "kind": "Namespace", "metadata": { "annotations": { "openshift.io/description": "", "openshift.io/display-name": "", "openshift.io/requester": "system:admin", "openshift.io/sa.scc.mcs": "s0:c16,c15", "openshift.io/sa.scc.supplemental-groups": "1000270000/10000", "openshift.io/sa.scc.uid-range": "1000270000/10000" }, "creationTimestamp": "2020-04-27T08:35:29Z", "deletionTimestamp": "2020-04-27T09:07:22Z", "name": "test", "resourceVersion": "3480943", "selfLink": "/api/v1/namespaces/test", "uid": "0d2d425c-8862-11ea-bce9-fa163eb0b490" }, "spec": { "finalizers": [] }, "status": { "phase": "Terminating" } }
-
Setup a temporary proxy, please keep this terminal open until the namespace is deleted.
$ oc proxy
-
In a new terminal window, replace
with the name of the failing project/namespace and enter the following: $ curl -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json http://127.0.0.1:8001/api/v1/namespaces/${PROJECT_NAME}/finalize
-
If you get authorization errors, you can also try running this on a master using certificates without the proxy command above:
# curl --cacert /etc/origin/master/ca.crt --key /etc/origin/master/admin.key --cert /etc/origin/master/admin.crt -k -H "Content-Type: application/json" -X PUT --data-binary @tmp.json https://127.0.0.1:8443/api/v1/namespaces/<terminating-namespace>/finalize
Root Cause
Every Kubernetes namespace has a kubernetes
finalizer that prevents its final deletion when a delete on that namespace is requested. Reason is so that masters can delete the resources in the namespace before deleting the namespace itself.
However, many different reasons can lead to some of this resources to not be properly deleted. A typical example is the failure of an external apiservice (like service catalog).
This solution provides general guidance on how to troubleshoot this kind of situations as well as a procedure to force namespace deletion or the deletion of individual objects that can also be stuck on finalizers. But those procedures must be used as a last resort.
Diagnostic Steps
To double-check whether a project is stuck "Terminating", you can do the following:
First, attempt running:
$ oc delete project <project name> --force --grace-period=0
If this comes back with output of:
warning: Immediate deletion does not wait for confirmation that the running resource has been terminated. The resource may continue to run on the cluster indefinitely.
Error from server (Conflict): Operation cannot be fulfilled on namespaces "<project>": The system is ensuring all content is removed from this namespace. Upon completion, this namespace will automatically be purged by the system.`
Then run:
$ oc get project <project name> -o yaml
If you see the following at the bottom of the output:
spec:
finalizers:
- kubernetes
status:
phase: Terminating
The 'kubernetes' finalizer is keeping the project from deletion. You need to troubleshoot why and fix as per "resolution" section.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
16 Comments
Getting error:
:namespaces \namespace\ is forbidden: User \"system:anonymous\" cannot update namespaces/finalize in the namespace \"jenkins\": no RBAC policy attached.
But when oc whoami command ran
cluster-admin
and then attempted from system:admin account
me too , thats why this solution unverified...
Isiaha , Ekky
That error means that the kube-apiserver is not acknowledging the token or client certificate your kubeconfig is using to demonstrate your identity, so it treats your oc client as anonymous user.
Running "oc whoami" and getting a "cluster-admin" answer means that your user's name is "cluster-admin".
If it failed for both, "cluster-admin" and "system:admin" users, then kubeconfig might be wrong.
IF you
this cmd was most usefull:
fyi, I had this due to stuck rolebindings.authorization.openshift.io / rolebinding.rbac.authorization.k8s.io.
due to kubernetes.io/iam.security.ibm.com it seems. https://bugzilla.redhat.com/show_bug.cgi?id=1932096 https://www.ibm.com/docs/en/cloud-paks/cp-applications/4.3?topic=troubleshooting-installation-issues#reinstallfails
Same here This command showed that there was a RoleBinding that was preventing the project from being removed. But deleting it even forced didn't work. What did work as explained was to set the finalizer to [] followed by a delete --force --grace-period=0 on the RoleBinding did removed it from the project and the project disappeared after. There was a bug filed for this almost exact situation, bug that was closed as not a bug.
https://bugzilla.redhat.com/show_bug.cgi?id=1932096
Thanks for the excellent article!
None solution worked in RH opentlc lab cee-cf-111.
A solution from RH can not end with the suggestion that leaves the user to do troubleshoot. The user did troubleshoot and it didn't work thus raising the issue to RH.
Article does not help too much. I have the issue on an OKD 4.9 cluster (on vSphere UPI). looking at the resources of one of the projects stuck in terminating mode gives:
Error from server (InternalError): Internal error occurred: error resolving resource
followed by a long list of packagemanifest.packages.operators.co manifest names from the operator catalog. No indication of the resource causing the error.
Looking at etcd, there are no resources associated with the project. Project status includes:
The only suggestions left in this article are labeled USE WITH CAUTION.
Also the code snippet using etcd has an undefined environment variable ETCDCTL_COMMAND
The knowledge article needs more information on debugging and safely resolving the issue.
Hello,
The ETCDCTL_COMMAND thingy was a typo that has been fixed.
However, you seem to have a problem with an aggregated API server. As this may be a complex problem and OKD is not supported by Red Hat, I'd suggest you try to seek help from the OKD community: https://www.okd.io/help/
If you reproduce this on a supported Red Hat OpenShift Container Platform cluster, please open a support case for that cluster.
RE "OKD is not supported by Red Hat," yes, that is what we tell people.
As indicated in my post above, "oc get " returns an internal error resolving resource and does not name the offending resource (OCP bug?).
A helpful addition to this knowledge base solution might be to replace that oc get with a script that echo's the resource name, then calls oc get. This is what allowed me to track down the problem to a bad certificate field in a CRD.
Echoing the resource name for each invocation is a good idea, so I implemented it but in a simpler way: by adding
-t
to thexargs
invocations, so that the invoked command lines are printed to stderr.Regarding the internal error, that is likely to be a bug that may need to be triaged. So, if found in OKD, feel free to open an issue if you see you have the right information to do so. If found in RHOCP with a proper subscription, please open a support case.
Your solution is better than mine was, thanks,
OKD issue: https://github.com/openshift/okd/issues/1222
OCP issue: https://bugzilla.redhat.com/show_bug.cgi?id=2084960
As you indicate, there is some underlying problem that is likely more serious than the "oc get" issue. Hopefully someone will follow it furhter.
If you can reproduce this issue with Enterprise RHOCP and you have proper subscriptions, please open a support case. This kind of issues must be first triaged by Red Hat Support before opening a bugzilla to the product, as they may happen due to a problem in your cluster and not due to a bug in the product.
If you don't have an account with Red Hat and/or proper subscriptions for Red Hat OpenShift Container Platform, you would need then to stick to the OKD issue.
No.
I am following the proper procedures, as you will see by looking at the OKD issue and the OCP issue (linked above).
I am just trying to help get rid of one of the causes of projects/namespaces hanging in terminating mode (due in this case to an internal error when checking the project's resources).
However, as I can reproduce the issue on Minikube, it seems to be upstream from OCP.
I see there is some misconception about what is this article and what it goal is.
The goal of this article is to:
The "USE WITH CAUTION" warnings are there for a reason. Just as an example, as early versions of this article did not document these risks properly, I saw somebody forcing the removal of a project without all the services having been removed, so there were services in the API for a non-existing project, which in turn caused the OpenShift SDN to start misbehaving in a horrible manner that caused great cluster-wide impact.
This article is not direct and easy "I have a problem, I apply some steps, I get a solution" and doesn't intend to be that. There may be concrete situations that may cause stuck objects that may be identified in a such simple manner and each of them would deserve its own smaller solution. This is a generic reference on how to apply these "for very extreme emergency advanced procedures" and some generic guidelines on what to do before trying that.
With ServiceMesh, there are 2 resources that can not be terminated using oc patch -n / --type=merge -p '{"metadata": {"finalizers":null}}'
$ oc patch -n istio-system servicemeshcontrolplane.maistra.io/service-mesh-installation --type=merge -p '{"metadata": {"finalizers":null}}' Error from server (InternalError): Internal error occurred: failed calling webhook "smcp.mutation.maistra.io": Post "https://maistra-admission-controller.openshift-operators.svc:443/mutate-smcp?timeout=10s": no endpoints available for service "maistra-admission-controller"
$ oc get servicemeshcontrolplane.maistra.io/service-mesh-installation NAME READY STATUS PROFILES VERSION AGE service-mesh-installation 9/9 ComponentsReady ["default"] 2.0.8 198d
$ oc delete servicemeshcontrolplane.maistra.io/service-mesh-installation servicemeshcontrolplane.maistra.io "service-mesh-installation" deleted
is also hanged