Compliance Operator scan fail with ocp4-cis-node-worker-* pods in 1/2 NotReady state

Solution Verified - Updated -

Environment

  • Red Hat Openshift Container Platform
    • 4.10
  • Compliance Operator
    • 0.1.59

Issue

  • ocp4-cis-node-worker-* pods are in 1/2 NotReady state. These ocp4-cis-node-worker-* pods are supposed to be running on all nodes.
  • Compliance scans are running since more than 24 hours, however, the scan's aren't successful
  • Deleting result-server and result-client secrets didn't fix the issue as these secrets gets recreated with old expired certificates

Resolution

Secret root-ca-ocp4-cis-node-worker contains an old expired certificate. Delete this secret so that it gets recreated with a new valid certificate:

$ oc delete secret root-ca-ocp4-cis-node-worker 

After deleting the above secret, all ocp4-cis-node-worker-* pods get into Running state. Compliance scans also complete successfully.

Root Cause

The log-collector container for all failing ocp4-cis-node-worker-* pods fail to upload results to server due to an expired certificate. The certificates of these pods are expired as the Compliance scan was running for more than 24 hours. The certificates of such pods are generated when scan is initiated and they expire within 24 hours.

Diagnostic Steps

  • Check if all ocp4-cis-node-worker-* pods are in NotReady state

    $ oc get pods -n openshift-compliance | grep -E 'ocp4-cis-node-worker-|^NAME'
    NAME                                                     READY   STATUS      RESTARTS        AGE
    ocp4-cis-node-worker-infra01-xxxxxxx-xxxxx               1/2     NotReady    1 (3m24s ago)   9m1s
    ocp4-cis-node-worker-master01-xxxxxxx-xxxx               1/2     NotReady    1 (4m31s ago)   9m2s
    ocp4-cis-node-worker-worker01-xxxxxxx-xxxx               1/2     NotReady    1 (3m3s ago)    9m2s
    ocp4-cis-node-worker-worker02-xxxxxxx-xxxx               1/2     NotReady    1 (3m3s ago)    9m2s
    ocp4-cis-node-worker-worker03-xxxxxxx-xxxx               1/2     NotReady    1 (3m3s ago)    9m2s
    
  • Check if the log-collector container for the failing ocp4-cis-node-worker-* pods show the below message:

    $ oc logs ocp4-cis-node-worker-worker01-xxxxxxx-xxxx -c log-collector | grep 'certificate has expired'
    "msg":"Failed to upload results to server","error":"Post \"https://ocp4-cis-node-worker-rs:8443/\": x509: certificate has expired or is not yet valid: current time 2023-01-02T16:23:58Z is after 2022-12-07T01:00:29Z"
    

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments