After cluster installation, some required monitoring pods cannot get persistent volumes attached when using custom AWS KMS key

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Service on AWS [ROSA]
    • 4.x

Issue

  • After installing a Red Hat OpenShift on AWS by using a custom AWS KMS key, the monitoring operator is in DEGRADED state.
  • The pods alertmanager-main and prometheus-k8s along with their Persistent Volume Claims from openshift-monitoring namespace are on Pending state.

Resolution

Disclaimer: Links contained herein to external website(s) are provided for convenience only. Red Hat has not reviewed the links and is not responsible for the content or its availability. The inclusion of any link to an external website does not imply endorsement by Red Hat of the website or their entities, products or services. You agree that Red Hat is not responsible or liable for any loss or expenses that may result due to your use of (or reliance on) the external site or content.

  • This issue can be avoided during cluster installation by adding the AWS role <_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred to the AWS KMS key permissions. Please refer to the steps below, Steps required during cluster installation.

  • The issue can also be fixed by implementing some additional steps after cluster installation, as procedure presented below at Steps required after cluster installation.

Steps required during cluster installation

  • If you are planning to install a new cluster by using your custom AWS KMS key (please refer to the ROSA documentation), you may consider running the following procedure:

    1. Create the KMS key by using the AWS documentation.

    2. Run the command to create the cluster in the interactive mode:

      $ rosa create cluster --interactive --sts
      
    3. Provide the KMS key ARN when requested. For more information about ARN (Amazon Resource Names), please refer to the AWS documentation.

    4. When the installations says:

      I: Run the following commands to continue the cluster creation: 
          rosa create operator-roles --cluster <_clustername_>
          rosa create oidc-provider --cluster <_clustername_>
      

      Run the first command which generates the operator roles:

      $ rosa create operator-roles --cluster <_clustername_>
      
    5. Once all the roles are created, get the ARN from the role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' (as presented at the output below) and modify your existing KMS key policy. The list of permissions are presented at the file attached: "AWS KMS key".

      I: Created role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' with ARN 'arn:aws:iam::<_aws-account-id_>:role/<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred'
      
    6. And finally, run the second command to continue the installation:

      $ rosa create oidc-provider --cluster <_clustername_>
      

    The installation should proceed with no issues.

Steps required after cluster installation

  • With the cluster already installed, run the following procedure to fix the issue:

    1. Get the ARN from the role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' in the AWS Console and modify your existing KMS key policy. The list of permissions are presented at the file attached: "AWS KMS key".

    2. Once the role is attached to the key permission, delete the Persistent Volume Claimsfrom openshift-monitoring namespace:

      $ oc delete pvc \
      alertmanager-data-alertmanager-main-0 \
      alertmanager-data-alertmanager-main-1 \
      prometheus-data-prometheus-k8s-0 \
      prometheus-data-prometheus-k8s-1 -n openshift-monitoring
      
    3. With the PVCs deleted, also get the affected pods deleted from openshift-monitoring namespace:

      $ oc delete pod \
      alertmanager-main-0 \
      alertmanager-main-1 \
      prometheus-k8s-0 \
      prometheus-k8s-1 -n openshift-monitoring
      
    4. The pods are expected to be scheduled with the Persistent Volume Claims bound.

Root Cause

  • When using a custom AWS KMS key, the operator role created during installation, <_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred, requires permissions to use the key in order to provision the Persistent Volumes needed for the pods in the openshift-monitoring namespace, and it is missed from the KMS key permissions.

Diagnostic Steps

  • Check the monitoring cluster operator, it is expected to be DEGRADED and PROGRESSING:
$ oc get co monitoring
NAME         VERSION   AVAILABLE   PROGRESSING   DEGRADED   SINCE   MESSAGE
monitoring   4.11.4    False       True          True       3d19h   Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
  • Check the pods status on openshift-monitoring namespace. The pods alertmanager-main and prometheus-k8s appear as Pending:
$ oc get pods -n openshift-monitoring | grep Pending
alertmanager-main-0                                      0/6     Pending     0                3d18h
alertmanager-main-1                                      0/6     Pending     0                3d18h
prometheus-k8s-0                                         0/6     Pending     0                3d18h
prometheus-k8s-1                                         0/6     Pending     0                3d18h
  • Check the events on openshift-monitoring namespace and look for the messages below:
$ oc get events --sort-by='{.lastTimestamp}' -n openshift-monitoring | grep alertmanager-main-0
11m         Warning   FailedScheduling       pod/alertmanager-main-0                                       running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
8m21s       Normal    Provisioning           persistentvolumeclaim/alertmanager-data-alertmanager-main-0   External provisioner is provisioning volume for claim "openshift-monitoring/alertmanager-data-alertmanager-main-0"
2m46s       Normal    ExternalProvisioning   persistentvolumeclaim/alertmanager-data-alertmanager-main-0   waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
  • Check the status of Persistent Volume Claims on openshift-monitoring namespace, they also appear as Pending:
$ oc get pvc -n openshift-monitoring
NAME                                    STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS       AGE
alertmanager-data-alertmanager-main-0   Pending                                      gp3-customer-kms   3d18h
alertmanager-data-alertmanager-main-1   Pending                                      gp3-customer-kms   3d18h
prometheus-data-prometheus-k8s-0        Pending                                      gp3-customer-kms   3d18h
prometheus-data-prometheus-k8s-1        Pending                                      gp3-customer-kms   3d18h
  • Describe the Persistent Volume Claims on openshift-monitoring namespace, they are expected to present the events as below:
$ oc describe pvc alertmanager-data-alertmanager-main-0 -n openshift-monitoring
... <content omitted> ...
Events:
  Type    Reason                Age                        From                                                                 Message
  ----    ------                ----                       ----                                                                 -------
  Normal  ExternalProvisioning  3m30s (x22263 over 3d18h)  persistentvolume-controller                                          waiting for a volume to be created, either by external provisioner "ebs.csi.aws.com" or manually created by system administrator
  Normal  Provisioning          22s (x1458 over 3d18h)     ebs.csi.aws.com_ip-<ip_address>_8a7fd702-c7eb-4182-b72f-1d2b9c4c8de5  External provisioner is provisioning volume for claim "openshift-monitoring/alertmanager-data-alertmanager-main-0"
  • No Persistent Volumes are expected to be available for the related Persistent Volume Claims:
$ oc get pv
No resources found

Attachments

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments