After cluster installation, some required monitoring pods cannot get persistent volumes attached when using custom AWS KMS key
Environment
- Red Hat OpenShift Service on AWS [ROSA]
- 4.x
Issue
- After installing a Red Hat OpenShift on AWS by using a custom AWS KMS key, the
monitoring
operator is in DEGRADED state. - The pods
alertmanager-main
andprometheus-k8s
along with theirPersistent Volume Claims
fromopenshift-monitoring
namespace are onPending
state.
Resolution
Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.
-
This issue can be avoided during cluster installation by adding the AWS role <_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred to the AWS KMS key permissions. Please refer to the steps below, Steps required during cluster installation.
-
The issue can also be fixed by implementing some additional steps after cluster installation, as procedure presented below at Steps required after cluster installation.
Steps required during cluster installation
-
If you are planning to install a new cluster by using your custom AWS KMS key (please refer to the ROSA documentation), you may consider running the following procedure:
-
Create the KMS key by using the AWS documentation.
-
Run the command to create the cluster in the interactive mode:
$ rosa create cluster --interactive --sts
-
Provide the KMS key ARN when requested. For more information about ARN (Amazon Resource Names), please refer to the AWS documentation.
-
When the installations says:
I: Run the following commands to continue the cluster creation: rosa create operator-roles --cluster <_clustername_> rosa create oidc-provider --cluster <_clustername_>
Run the first command which generates the operator roles:
$ rosa create operator-roles --cluster <_clustername_>
-
Once all the roles are created, get the ARN from the role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' (as presented at the output below) and modify your existing KMS key policy. The list of permissions are presented at the file attached: "AWS KMS key".
I: Created role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' with ARN 'arn:aws:iam::<_aws-account-id_>:role/<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred'
-
And finally, run the second command to continue the installation:
$ rosa create oidc-provider --cluster <_clustername_>
The installation should proceed with no issues.
-
Steps required after cluster installation
-
With the cluster already installed, run the following procedure to fix the issue:
-
Get the ARN from the role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' in the AWS Console and modify your existing KMS key policy. The list of permissions are presented at the file attached: "AWS KMS key".
-
Once the role is attached to the key permission, delete the
Persistent Volume Claims
fromopenshift-monitoring
namespace:$ oc delete pvc \ alertmanager-data-alertmanager-main-0 \ alertmanager-data-alertmanager-main-1 \ prometheus-data-prometheus-k8s-0 \ prometheus-data-prometheus-k8s-1 -n openshift-monitoring
-
With the PVCs deleted, also get the affected pods deleted from
openshift-monitoring
namespace:$ oc delete pod \ alertmanager-main-0 \ alertmanager-main-1 \ prometheus-k8s-0 \ prometheus-k8s-1 -n openshift-monitoring
-
The pods are expected to be scheduled with the
Persistent Volume Claims
bound.
-
Root Cause
- When using a custom AWS KMS key, the operator role created during installation, <_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred, requires permissions to use the key in order to provision the
Persistent Volumes
needed for the pods in theopenshift-monitoring
namespace, and it is missed from the KMS key permissions.
Diagnostic Steps
- Check the
monitoring
cluster operator, it is expected to be DEGRADED and PROGRESSING:
$ oc get co monitoring
NAME VERSION AVAILABLE PROGRESSING DEGRADED SINCE MESSAGE
monitoring 4.11.4 False True True 3d19h Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
- Check the pods status on
openshift-monitoring
namespace. The podsalertmanager-main
andprometheus-k8s
appear asPending
:
$ oc get pods -n openshift-monitoring | grep Pending
alertmanager-main-0 0/6 Pending 0 3d18h
alertmanager-main-1 0/6 Pending 0 3d18h
prometheus-k8s-0 0/6 Pending 0 3d18h
prometheus-k8s-1 0/6 Pending 0 3d18h
- Check the events on
openshift-monitoring
namespace and look for the messages below:
$ oc get events --sort-by='{.lastTimestamp}' -n openshift-monitoring | grep alertmanager-main-0
11m Warning FailedScheduling pod/alertmanager-main-0 running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
8m21s Normal Provisioning persistentvolumeclaim/alertmanager-data-alertmanager-main-0 External provisioner is provisioning volume for claim "openshift-monitoring/alertmanager-data-alertmanager-main-0"
2m46s Normal ExternalProvisioning persistentvolumeclaim/alertmanager-data-alertmanager-main-0 waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
- Check the status of
Persistent Volume Claims
onopenshift-monitoring
namespace, they also appear asPending
:
$ oc get pvc -n openshift-monitoring
NAME STATUS VOLUME CAPACITY ACCESS MODES STORAGECLASS AGE
alertmanager-data-alertmanager-main-0 Pending gp3-customer-kms 3d18h
alertmanager-data-alertmanager-main-1 Pending gp3-customer-kms 3d18h
prometheus-data-prometheus-k8s-0 Pending gp3-customer-kms 3d18h
prometheus-data-prometheus-k8s-1 Pending gp3-customer-kms 3d18h
- Describe the
Persistent Volume Claims
onopenshift-monitoring
namespace, they are expected to present the events as below:
$ oc describe pvc alertmanager-data-alertmanager-main-0 -n openshift-monitoring
... <content omitted> ...
Events:
Type Reason Age From Message
---- ------ ---- ---- -------
Normal ExternalProvisioning 3m30s (x22263 over 3d18h) persistentvolume-controller waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
Normal Provisioning 22s (x1458 over 3d18h) ebs.csi.aws.com_ip-<ip_address>_8a7fd702-c7eb-4182-b72f-1d2b9c4c8de5 External provisioner is provisioning volume for claim "openshift-monitoring/alertmanager-data-alertmanager-main-0"
- No
Persistent Volumes
are expected to be available for the relatedPersistent Volume Claims
:
$ oc get pv
No resources found
Attachments
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments