After cluster installation, some required monitoring pods cannot get persistent volumes attached when using custom AWS KMS key

  • Red Hat OpenShift Service on AWS [ROSA]
    • 4.x


  • After installing a Red Hat OpenShift on AWS by using a custom AWS KMS key, the monitoring operator is in DEGRADED state.
  • The pods alertmanager-main and prometheus-k8s along with their Persistent Volume Claims from openshift-monitoring namespace are on Pending state.


  • This issue can be avoided during cluster installation by adding the AWS role <_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred to the AWS KMS key permissions. Please refer to the steps below, Steps required during cluster installation.

  • The issue can also be fixed by implementing some additional steps after cluster installation, as procedure presented below at Steps required after cluster installation.

Steps required during cluster installation

  • If you are planning to install a new cluster by using your custom AWS KMS key (please refer to the ROSA documentation), you may consider running the following procedure:

    1. Create the KMS key by using the AWS documentation.

    2. Run the command to create the cluster in the interactive mode:

      $ rosa create cluster --interactive --sts
    3. Provide the KMS key ARN when requested. For more information about ARN (Amazon Resource Names), please refer to the AWS documentation.

    4. When the installations says:

      I: Run the following commands to continue the cluster creation: 
          rosa create operator-roles --cluster <_clustername_>
          rosa create oidc-provider --cluster <_clustername_>

      Run the first command which generates the operator roles:

      $ rosa create operator-roles --cluster <_clustername_>
    5. Once all the roles are created, get the ARN from the role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' (as presented at the output below) and modify your existing KMS key policy. The list of permissions are presented at the file attached: "AWS KMS key".

      I: Created role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' with ARN 'arn:aws:iam::<_aws-account-id_>:role/<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred'
    6. And finally, run the second command to continue the installation:

      $ rosa create oidc-provider --cluster <_clustername_>

    The installation should proceed with no issues.

Steps required after cluster installation

  • With the cluster already installed, run the following procedure to fix the issue:

    1. Get the ARN from the role '<_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred' in the AWS Console and modify your existing KMS key policy. The list of permissions are presented at the file attached: "AWS KMS key".

    2. Once the role is attached to the key permission, delete the Persistent Volume Claimsfrom openshift-monitoring namespace:

      $ oc delete pvc \
      alertmanager-data-alertmanager-main-0 \
      alertmanager-data-alertmanager-main-1 \
      prometheus-data-prometheus-k8s-0 \
      prometheus-data-prometheus-k8s-1 -n openshift-monitoring
    3. With the PVCs deleted, also get the affected pods deleted from openshift-monitoring namespace:

      $ oc delete pod \
      alertmanager-main-0 \
      alertmanager-main-1 \
      prometheus-k8s-0 \
      prometheus-k8s-1 -n openshift-monitoring
    4. The pods are expected to be scheduled with the Persistent Volume Claims bound.

Root Cause

  • When using a custom AWS KMS key, the operator role created during installation, <_clustername_>-openshift-cluster-csi-drivers-ebs-cloud-cred, requires permissions to use the key in order to provision the Persistent Volumes needed for the pods in the openshift-monitoring namespace, and it is missed from the KMS key permissions.

Diagnostic Steps

  • Check the monitoring cluster operator, it is expected to be DEGRADED and PROGRESSING:
$ oc get co monitoring
monitoring   4.11.4    False       True          True       3d19h   Rollout of the monitoring stack failed and is degraded. Please investigate the degraded status error.
  • Check the pods status on openshift-monitoring namespace. The pods alertmanager-main and prometheus-k8s appear as Pending:
$ oc get pods -n openshift-monitoring | grep Pending
alertmanager-main-0                                      0/6     Pending     0                3d18h
alertmanager-main-1                                      0/6     Pending     0                3d18h
prometheus-k8s-0                                         0/6     Pending     0                3d18h
prometheus-k8s-1                                         0/6     Pending     0                3d18h
  • Check the events on openshift-monitoring namespace and look for the messages below:
$ oc get events --sort-by='{.lastTimestamp}' -n openshift-monitoring | grep alertmanager-main-0
11m         Warning   FailedScheduling       pod/alertmanager-main-0                                       running PreBind plugin "VolumeBinding": binding volumes: timed out waiting for the condition
8m21s       Normal    Provisioning           persistentvolumeclaim/alertmanager-data-alertmanager-main-0   External provisioner is provisioning volume for claim "openshift-monitoring/alertmanager-data-alertmanager-main-0"
2m46s       Normal    ExternalProvisioning   persistentvolumeclaim/alertmanager-data-alertmanager-main-0   waiting for a volume to be created, either by external provisioner "" or manually created by system administrator
  • Check the status of Persistent Volume Claims on openshift-monitoring namespace, they also appear as Pending:
$ oc get pvc -n openshift-monitoring
NAME                                    STATUS    VOLUME   CAPACITY   ACCESS MODES   STORAGECLASS       AGE
alertmanager-data-alertmanager-main-0   Pending                                      gp3-customer-kms   3d18h
alertmanager-data-alertmanager-main-1   Pending                                      gp3-customer-kms   3d18h
prometheus-data-prometheus-k8s-0        Pending                                      gp3-customer-kms   3d18h
prometheus-data-prometheus-k8s-1        Pending                                      gp3-customer-kms   3d18h
  • Describe the Persistent Volume Claims on openshift-monitoring namespace, they are expected to present the events as below:
$ oc describe pvc alertmanager-data-alertmanager-main-0 -n openshift-monitoring
... <content omitted> ...
  Type    Reason                Age                        From                                                                 Message
  ----    ------                ----                       ----                                                                 -------
  Normal  ExternalProvisioning  3m30s (x22263 over 3d18h)  persistentvolume-controller                                          waiting for a volume to be created, either by external provisioner "" or manually created by system administrator
  Normal  Provisioning          22s (x1458 over 3d18h)<ip_address>_8a7fd702-c7eb-4182-b72f-1d2b9c4c8de5  External provisioner is provisioning volume for claim "openshift-monitoring/alertmanager-data-alertmanager-main-0"
  • No Persistent Volumes are expected to be available for the related Persistent Volume Claims:
$ oc get pv
No resources found


