Regenerating Openshift Cluster Certificates

Updated -

This procedure is specific to RHSB-2023-001.

An Openshift Container Platform cluster has multiple chains of trust using internal self-signed root CAs. You can force the OpenShift Container Platform to regenerate these CAs and the associated certificates.

Specifically to ROSA, the managed service SRE will perform the FIPS certificate rotation. The ARO teams do not need to rotate.

If you are regenerating certificates for a cluster running in FIPS mode, you must execute these steps from a FIPS-compliant environment, such as a RHEL server booted in FIPS mode.

The purpose of regenerating cluster certificates is to avoid the use of certificates generated by non-FIPS compliant binaries. To achieve this objective, this procedure must be performed after upgrading the target cluster to a fully FIPS compliant version.

Regenerating selected control plane certificates

Use the following procedure to regenerate selected certificates for the control plane. This does not rotate all certificates.

Prerequisites

  • You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
  • You have read https://access.redhat.com/security/vulnerabilities/RHSB-2023-001 and have applied the FIPS remediated z stream for the OpenShift version running on your cluster.
  • You have installed the remediated 4.11 z stream version of the OpenShift CLI (oc).
  • If you are using AWS with STS, you must also have:
    • ccoctl tool available
    • AWS CLI tool available
    • CLUSTER_ID that was used to push keys.json to s3. This was likely used with ccoctl during installation.

Procedure

  1. Note and save the date when this procedure started so that you have a date for revoking trust for old certificates in the cluster.
    date +"%Y-%m-%dT%H:%M:%S%:z"

    The output will appear in this format. Be sure to save this information.
    2023-06-16T10:57:04-04:00

  2. Confirm that your cluster is in a stable state.
    oc adm wait-for-stable-cluster --minimum-stable-period=5s

    The command should print the following output:

    All clusteroperators are stable

    If your cluster is not in a stable state, first get the operators stable. You can use this command to check.

  3. Generate new client certificates into namespace/openshift-config-managed – expiry will be about four weeks in the future.
    oc adm ocp-certificates regenerate-leaf -n openshift-config-managed secrets kube-controller-manager-client-cert-key kube-scheduler-client-cert-key

    The command should print the following output:

    secret/kube-controller-manager-client-cert-key regeneration set
    secret/kube-scheduler-client-cert-key regeneration set

  4. Generate new certificates into namespace/kube-apiserver-operator – expiry will be about four weeks in the future.
    oc adm ocp-certificates regenerate-leaf -n openshift-kube-apiserver-operator secrets node-system-admin-client

    The command should print the following output:
    secret/node-system-admin-client regeneration set

  5. Generate new certificates into namespace/kube-apiserver – expiry will be about four weeks in the future.
    oc adm ocp-certificates regenerate-leaf -n openshift-kube-apiserver secrets check-endpoints-client-cert-key control-plane-node-admin-client-cert-key  external-loadbalancer-serving-certkey internal-loadbalancer-serving-certkey kubelet-client localhost-recovery-serving-certkey localhost-serving-cert-certkey service-network-serving-certkey

    The command should print the following output:

    secret/check-endpoints-client-cert-key regeneration set
    secret/control-plane-node-admin-client-cert-key regeneration set
    secret/external-loadbalancer-serving-certkey regeneration set
    secret/internal-loadbalancer-serving-certkey regeneration set
    secret/kubelet-client regeneration set
    secret/localhost-recovery-serving-certkey regeneration set
    secret/localhost-serving-cert-certkey regeneration set
    secret/service-network-serving-certkey regeneration set

  6. Wait for cluster operators to stabilize after the change. Be patient, this can take 30 minutes.
    oc adm wait-for-stable-cluster

    The command should print the following output:
    All clusteroperators are stable

  7. Generate new root signers that were created with crypto modules that were not FIPS compliant.

    Attention: Once you do this step, you must complete all the steps up to and including step 17 (restarting nodes) as
    quickly as possible. After about four weeks, the cluster will automatically regenerate new leaf certificates using this new
    signer and anything not restarted will stop trusting the kube-apiserver.
    oc adm ocp-certificates regenerate-top-level -n openshift-kube-apiserver-operator secrets kube-apiserver-to-kubelet-signer
    kube-control-plane-signer loadbalancer-serving-signer localhost-serving-signer service-network-serving-signer

    The command should print the following output:
    secret/kube-apiserver-to-kubelet-signer regeneration set
    secret/kube-control-plane-signer regeneration set
    secret/loadbalancer-serving-signer regeneration set
    secret/localhost-serving-signer regeneration set
    secret/service-network-serving-signer regeneration set

  8. Trigger the clusteroperator/kube-controller-manager to create a new bound service account signing key.
    oc -n openshift-kube-controller-manager-operator delete secrets/next-service-account-private-key

    The command should produce the following output:
    secret "next-service-account-private-key" deleted

  9. Trigger the clusteroperator/kube-apiserver to create a new bound service account signing key.
    `oc -n openshift-kube-apiserver-operator delete secrets/next-bound-service-account-signing-key

    The command should produce the following output:

    secret "next-bound-service-account-signing-key" deleted

  10. Wait for cluster operators to stabilize after the change. Be patient, this can take 30-45 minutes.
    oc adm wait-for-stable-cluster

    The command should print the following output:
    All clusteroperators are stable

  11. If you are running on AWS and using STS, you will need to update the s3 bucket used by STS to trust the bound service account token signer.
    a. Collect the current public key used to sign bound SA token keys.
    oc get -n openshift-kube-apiserver-operator secret/next-bound-service-account-signing-key -ojsonpath='{ .data.service-account\.pub }' | base64 -d > bound-sa.pub

    This will write a bound-sa.pub file that contains an RSA public key.

    b. Produce the keys.json that we will need to provide to s3.
    ccoctl aws create-identity-provider --name=some-name --region=other-region --dry-run --public-key-file=bound-sa.pub

    This will write several files in the local directory. The only one we need is 03-keys.json.

    c. Upload the new keys.json to s3.
    aws s3 cp ./03-keys.json s3://rh-oidc-staging/$CLUSTER_ID/keys.json

    d. If you are using cloudfront to serve your OIDC provider, create a specific invalidation in CloudFront for the path of
    the keys.json, for example, /$CLUSTER_ID/keys.json.

How you do this may vary depending on your situation. For example,on the CLI, run the following command:

aws cloudfront create-invalidation --distribution-id $DISTRIBUTION_ID --path /$CLUSTER_ID/keys.json

  1. If you are running on GCP workload identity, you will need to perform the following updates to trust the bound service account token signer.
    a. Collect the current public key used to sign bound SA token keys.
    oc get -n openshift-kube-apiserver-operator secret/next-bound-service-account-signing-key -ojsonpath='{ .data.service-account\.pub }' | base64 -d > bound-sa.pub

    b. Produce the keys.json that we will need to provide to gs.
    # Must export GOOGLE_CREDENTIALS, otherwise the ccoctl command will prompt "Service Account (absolute path to file or JSON content) [Enter 2
    empty lines to finish]", even if
    gcloud auth listshows current account is the same aos-qe-serviceaccount@openshift-qe.iam.gserviceaccount.com .

    export GOOGLE_CREDENTIALS=/home/xxia/my/priv/cucushift-internal/config/credentials/openshift-qe-gce_v4.json

    REGION=$(oc get infrastructure cluster -o=jsonpath='{.status.platformStatus.gcp.region}')

    BUCKET=$(oc get authentication.config cluster -o jsonpath='{.spec.serviceAccountIssuer}' | grep -o '[^/]*$')

    WIDP=$(sed 's/-oidc$//' <<< $BUCKET)

    ccoctl gcp create-workload-identity-provider --name=$WIDP --region=$REGION --project=openshift-qe --dry-run --public-key-file=bound-sa.pub --
    workload-identity-pool=$WIDP

    c. Now upload the new keys.json to gs.
    # Must export company proxy otherwise the gsutil ls command is stuck with "INFO 0705 17:39:39.103127 retry_util.py] Retrying request, attempt #21..." if your network is not good to access google cloud services.

    https_proxy=http://squid.apac.redhat.com:3128 gsutil ls gs://$BUCKET/

    KEYS_FILE=$(ls 0?-keys.json) <br><br>\# unlike 03-keys.json for above aws sts cluster, it is 04-keys.json for GCP workload identity cluster

    https_proxy=http://squid.apac.redhat.com:3128 gsutil cp $KEYS_FILE gs://$BUCKET/keys.json

  2. Generate new client certificates for openshift-config-managed that were created with crypto modules that were not FIPS compliant.
    oc adm ocp-certificates regenerate-leaf -n openshift-config-managed secrets kube-controller-manager-client-cert-key kube-scheduler-client-cert-key

    The command should print the following output:

    secret/kube-controller-manager-client-cert-key regeneration set
    secret/kube-scheduler-client-cert-key regeneration set

  3. Update the CA bundle for your cluster on your local kubeconfig. This will rewrite the kubeconfig on your machine to include the same CA bundle that is injected into pods to recognize the kube-apiserver.
    oc config refresh-ca-bundle

    The command will usually produce a message such as,

    CA bundle for cluster "<your-cluster-name>" updated.

    If you get the following message:

    error: failed to update CA bundle: using system CA bundle to verify server, not allowing refresh to
    overwrite

    From the command, it means your connection to the kube-apiserver is using the system trust bundles, and the
    refresh-ca-bundle is not required, so you can continue.

  4. Create a new kubelet bootstrap.kubeconfig so that the kubelet will recognize the kube-apiserver after the kube-apiserver regenerates its serving certificates.
    oc config new-kubelet-bootstrap-kubeconfig > bootstrap.kubeconfig

    This will create a bootstrap.kubeconfig in your current working directory.

  5. Confirm the new bootstrap.kubeconfig works by running the following command:
    oc whoami --kubeconfig=bootstrap.kubeconfig --server=$(oc get infrastructure/cluster -ojsonpath='{ .status.apiServerURL }')

    The command should print the following output:

    system:serviceaccount:openshift-machine-config-operator:node-bootstrapper

  6. Copy the bootstrap.kubeconfig to every node.
    oc adm copy-to-node nodes --all --copy=bootstrap.kubeconfig=/etc/kubernetes/kubeconfig

  7. Restart all kubelets and remove their old kubelet.kubeconfig to pick up the new bootstrap.kubeconfig (two different kubeconfigs) and get new client certificates.
    oc adm restart-kubelet nodes --all --directive=RemoveKubeletKubeconfig

  8. Restart all nodes to ensure every pod restarts with updated trust bundles that include the new signers.
    oc adm reboot-machine-config-pool mcp/worker mcp/master

    The command should print the following output:

    machineconfig.machineconfiguration.openshift.io/95-oc-initiated-reboot-worker rolling reboot initiated
    machineconfigpool.machineconfiguration.openshift.io/worker rolling reboot initiated
    machineconfig.machineconfiguration.openshift.io/95-oc-initiated-reboot-master rolling reboot initiated
    machineconfigpool.machineconfiguration.openshift.io/master rolling reboot initiated

  9. Wait for the nodes to restart. If this command fails to complete, you will need to determine why particular nodes have not achieved the desired reboot. This usually occurs due to nodes being unhealthy in some way. Either get those nodes healthy or remove them. Because trust has rotated, a node that cannot make this restart will have pods that may not function as desired once new serving certificates are created by operators after the 30-day countdown from step 7.
    oc adm wait-for-node-reboot nodes --all

    When completed, the command should print. Status lines print every 30 seconds until it is successful. If the
    status appears frozen for an extended period, debug deeper into why particular nodes are not progressing.

    All nodes rebooted

  10. Generate new client certificates for the kube-apiserver-operator that were created with crypto modules that were not FIPS compliant.
    oc adm ocp-certificates regenerate-leaf -n openshift-kube-apiserver-operator secrets node-system-admin-client

    The command should print the following output:

    secret/node-system-admin-client regeneration set

  11. Generate new client certificates for the kube-apiserver that were created with crypto modules that were not FIPS compliant.
    oc adm ocp-certificates regenerate-leaf -n openshift-kube-apiserver secrets check-endpoints-client-cert-key control-plane-node-admin-client-cert-key external-loadbalancer-serving-certkey internal-loadbalancer-serving-certkey kubelet-client localhost-recovery-serving-certkey localhost-serving-cert-certkey service-network-serving-certkey

    The command should print the following output:

    secret/check-endpoints-client-cert-key regeneration set
    secret/control-plane-node-admin-client-cert-key regeneration set
    secret/external-loadbalancer-serving-certkey regeneration set
    secret/internal-loadbalancer-serving-certkey regeneration set
    secret/kubelet-client regeneration set
    secret/localhost-recovery-serving-certkey regeneration set
    secret/localhost-serving-cert-certkey regeneration set
    secret/service-network-serving-certkey regeneration set

  12. Wait for cluster operators to stabilize after the change. This process can take 30 minutes.
    oc adm wait-for-stable-cluster

    The command should print the following output:

    All clusteroperators are stable

    If the image-registry is stuck on an STS cluster, it is a symptom of a problem outside the image-registry
    itself. If it fails, double-check that the keys.json from step 11 is correct and restart all pods. This is
    best done by rebooting all nodes again.

    `oc adm reboot-machine-config-pool mcp/worker mcp/master

    oc adm wait-for-node-reboot nodes --all`

  13. At this point, the cluster is using new certificates, but still trusts old certificates.

  14. Create new system:masters/admin.kubeconfig by typing the following command:
    oc config new-admin-kubeconfig > admin.kubeconfig

    This will save a file in admin.kubeconfig that can contact the kube-apiserver using a new signer and client
    cert/key pair.

  15. Confirm your new admin.kubeconfig works by typing the following command:
    oc --kubeconfig=admin.kubeconfig whoami

    The command should print the following output:

    system:admin

  16. Revoke trust for old signers replaced above.
    oc adm ocp-certificates remove-old-trust -n openshift-kube-apiserver-operator configmaps kube-apiserver-to-kubelet-client-ca kube-control-plane-signer-ca loadbalancer-serving-ca localhost-serving-ca service-network-serving-ca --created-before=<date-from-step-1>

    The command should print the following output:

    configmap/kube-apiserver-to-kubelet-client-ca trust purged
    configmap/kube-control-plane-signer-ca trust purged
    configmap/loadbalancer-serving-ca trust purged
    configmap/localhost-serving-ca trust purged
    configmap/service-network-serving-ca trust purged

  17. Wait for the cluster operators to stabilize after the change. This process can take 30 minutes.
    oc adm wait-for-stable-cluster

    The command should print the following output:

    All clusteroperators are stable

  18. Restart all nodes to ensure every pod restarts with updated trust bundles that include the new signers.
    oc adm reboot-machine-config-pool mcp/worker mcp/master

    The command should print the following output:

    machineconfig.machineconfiguration.openshift.io/95-oc-initiated-reboot-worker rolling reboot initiated
    machineconfigpool.machineconfiguration.openshift.io/worker rolling reboot initiated
    machineconfig.machineconfiguration.openshift.io/95-oc-initiated-reboot-master rolling reboot initiated
    machineconfigpool.machineconfiguration.openshift.io/master rolling reboot initiated

  19. Wait for the nodes to restart. If this command fails to complete, you will need to determine why particular nodes have not achieved the desired reboot. This usually occurs due to nodes being unhealthy in some way.
    oc adm wait-for-node-reboot nodes --all

    When completed, the command should print. Status lines print every 30 seconds until it is successful. If the
    status appears frozen for an extended period of time, debug deeper into why particular nodes are not
    progressing.

    All nodes rebooted.

Regenerating CA certificates for the Machine Config Server

Important: Before performing this step, ensure that you are using the latest OC client binary versions for the respective OCP minor version of the cluster being updated. These can be obtained from the following locations:
Latest 4.10 oc
Latest 4.11 oc
Latest 4.12 oc
Latest 4.13 oc

The Machine Config Operator has a CA and cert pair used by the Machine Config Server that is not automatically rotated. This is only used when a node boots up and joins the cluster for the first time, and as a result:

  1. It may not fully live within the cluster.

  2. It is otherwise unused after installation time if no new nodes join the cluster.

The certificate chain is used by a new node’s ignition binary when it tries to verify the ignition contents served by the clusters’ Machine Config Server as follows (using IPI on AWS as an example):

  1. A machineset is scaled up, adding a new node to the cluster.

  2. The user-data secret referenced in the machine set is passed to AWS’s cloud user-data, which is a stub ignition containing the MCS CA and an HTTP URL pointing to the Machine Config Server’s endpoint.

  3. The new node runs ignition in the initramfs phase on the firstboot to request for the full ignition config, during which it uses the CA data in the stub ignition to verify the incoming contents.

  4. The new node reboots into the full (worker) configuration and joins the cluster.

The rotation command has two parts:

  1. Rotate the MCS CA/cert in-cluster.

  2. By default, this calls to update the ignition in user-data to use the new certs.

The instructions will be different depending on how this secret is consumed in your particular cluster environment, as detailed below.

Prerequisites

  • You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
  • You have installed the OpenShift CLI (oc).

Procedure

Before continuing, create a backup of the secrets in the openshift-machine-config-operator project

oc get secret/machine-config-server-tls -n openshift-machine-config-operator -oyaml > machine-config-server-tls.bak
oc get secret/machine-config-server-ca -n openshift-machine-config-operator -oyaml > machine-config-server-ca.bak

Scenario 1: Fully automated (applicable to machineset backed node scaling, for example, cloud)

  1. Run one command to regenerate the CA/cert and automatically rotate the user-data.
    oc adm ocp-certificates regenerate-machine-config-server-serving-cert

    The command should produce the following output:

    Successfully rotated MCS CA + certs. Redeploying MCS and updating references.
    Successfully modified user-data secret master-user-data
    Successfully modified user-data secret worker-user-data

    To run the steps separately, run the following two commands:

    oc adm ocp-certificates regenerate-machine-config-server-serving-cert
    oc adm ocp-certificates update-ignition-ca-bundle-for-machine-config-server

  2. Optionally, verify that the CA and certificates have been successfully created
    oc adm ocp-certificates regenerate-machine-config-server-serving-cert --update-ignition=false
    oc -n openshift-machine-config-operator get secrets

    You should see the following newly generated:

    machine-config-server-ca
    machine-config-server-tls

    Secrets in the list

  3. Optionally, verify that you can scale a new worker node into the cluster, which will incur extra resource costs.

Scenario 2: Semi automated, manual user-data updating (applicable to non-machineset backed scaling, for example, metal/pxe)

  1. Run the regenerate command without updating the user-data secrets with the --update-ignition flag. The flag is optional and should not error if the secrets are unused or do not exist.
    oc adm ocp-certificates regenerate-machine-config-server-serving-cert --update-ignition=false

    The command should produce the following output:

    `Successfully rotated MCS CA + certs. Redeploying MCS and updating references.

  2. Find the updated MCS CA cert:
    oc -n openshift-machine-config-operator get secret/machine-config-server-ca -o=jsonpath='{.data.tls\.crt}'

  3. Find your current stub ignition, likely generated by the installer, which is now used to boot nodes. It will look like this:
    {"ignition":{"config":{"merge":[{"source":"..."}]},"security":{"tls":{"certificateAuthorities":[{"source":"data:text/plain;charset=utf-8;base64,CERT"}]}},"version":"3.2.0"}}

    Replace the CERT field with the contents in 2.

    Optionally, you can try adding a new node with the updated configuration to see if it can join the cluster. The
    new cert is otherwise unused if no new nodes need to join the cluster.

Regenerating the internal CA for Ingress

After cluster installation, the Ingress Operator generates an internal self-signed CA and stores it in the router-ca secret in the openshift-ingress-operator namespace. The Ingress Operator uses this CA to issue a wildcard certificate for the default Ingress Controller, as well as for custom Ingress Controllers.

The cluster admin must replace the operator-generated wildcard certificate with a custom wildcard certificate after installation, before putting the cluster into production. See Replacing the default ingress certificate for more information.

After the cluster admin has configured a custom wildcard certificate, the self-signed CA certificate in the router-ca secret still exists, even if it is not in use. If this certificate expires or otherwise needs to be regenerated, use the following procedure to regenerate it.

Prerequisites

  • You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.
  • You have access to an OpenShift Container Platform cluster using an account with cluster-admin permissions.

Procedure

  1. Delete the router-ca secret.
    oc -n openshift-ingress-operator delete secrets/router-ca

    The command should produce the following output:

    secret "router-ca" deleted

  2. Verify that the router-ca secret is gone.
    oc -n openshift-ingress-operator get secrets/router-ca

    The command should produce the following output:

    Error from server (NotFound): secrets "route-ca" not found

  3. Restart the Ingress Operator.
    oc -n openshift-ingress-operator delete pods -l name=ingress-operator

    The command should produce output similar to the following (the part of the pod name after ingress-operator- will vary):
    pod "ingress-operator-db59c9d96-t58ch" deleted

    OpenShift Container Platform will then automatically restart the Ingress Operator.

  4. Verify that the Ingress Operator restarts.
    oc -n openshift-ingress-operator get pods -l name=ingress-operator

The command should produce output similar to the following (the pod name and age will vary):

NAME READY STATUS RESTARTS AGE
ingress-operator-c549d87f6-ddmgd 2/2 Running 0 10s

When the Ingress Operator observes that the router-ca secret is missing, the Ingress Operator generates a new self-signed CA and recreates the secret.

  1. Verify that the Ingress Operator recreates the router-ca secret:
    oc -n openshift-ingress-operator get secrets/router-ca

    The command should produce the following output (the age may vary):

NAME TYPE DATA AGE
router-ca kubernetes.io/tls 2 1s

The age should reflect that the secret has been recreated since you started the procedure.

Comments