Investigating OpenShift CSR Issues

Solution Verified - Updated -

Environment

  • Red Hat OpenShift 3

Issue

  • During an OCP 3.10+ installation, upgrade, or scaleup a certificate approval failure has occurred
  • "Cound not find csr for nodes" when installing Openshift 3.11

Resolution

  1. Ensure that all pending CSRs are approved

    oc get csr -o name | xargs oc adm certificate approve
    
  2. Ensure that atomic-openshift-node service is running on all relevant nodes

    systemctl status atomic-openshift-node
    
  3. Ensure that the API server can proxy a request to the node's kubelet

    oc get --raw /api/v1/nodes/${NAME}/proxy/healthz
    /// alternative to check all
    for i in $(oc get nodes --no-headers -o=custom-columns=NAME:.metadata.name); do printf "${i}\n"; oc get --raw /api/v1/nodes/${i}/proxy/healthz ; printf "\n"; done;
    

Diagnostic Steps

  1. Review the journal of atomic-openshift-node for errors

  2. Determine the status of node's client and server certificates

    # ls -la /etc/origin/node/certificates/
    -rw-------. 1 root root 1167 Nov  5 14:40 kubelet-client-2018-11-05-14-40-27.pem
    lrwxrwxrwx. 1 root root   68 Nov  5 14:40 kubelet-client-current.pem -> /etc/origin/node/certificates/kubelet-client-2018-11-05-14-40-27.pem
    -rw-------. 1 root root 1366 Nov  5 14:40 kubelet-server-2018-11-05-14-40-31.pem
    lrwxrwxrwx. 1 root root   68 Nov  5 14:40 kubelet-server-current.pem -> /etc/origin/node/certificates/kubelet-server-2018-11-05-14-40-31.pem
    
  3. If either kubelet-client-current.pem or kubelet-server-current.pem symlinks are missing check for pending CSRs, if necessary review them and approve them

    oc get csr
    oc adm certificate approve csr-ABCDEF
    
  4. If both kubelet-client-current.pem and kubelet-server-current.pem symlinks are present it's likely that the check that proxies a request to the node's kubelet has failed due to external factors, the following command should indicate why that has failed.

    oc get --loglevel=9 --raw /api/v1/nodes/${NAME}/proxy/healthz
    
  5. Review the apiserver logs for indications of failure

    /usr/local/bin/master-logs api api
    
  6. Increase the logging verbosity of ansible and gather logs by adding '-vvv' and using 'tee' to save to a log and display on the console.

    ansible-playbook -i INVENTORY PLAYBOOK_PATH  -vvv | tee ~/ansible.log
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

1 Comments

I was facing an issue where the atomic-openshift-node on a infra node was getting stuck at "csr for this node is still valid" . Approving the CSR solved the problem. It was hard to debug though because the infra node names does not appears in the listing of "oc get csr". Instead, it's shown as "system:serviceaccount:openshift-infra:node-bootstrapper".