AlertManager is firing an alert named: "KubeJobCompletion"

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform
    • 4.5

Issue

  • What is the meaning of the alert KubeJobCompletion?
  • Is this alert critical?

Resolution

  • Alert KubeJobCompletion means some job pod in the cluster did not complete in success due to some reason.
  • Although this alert's severity is shownWarning, it is a good practise to run Diagnostic Steps and look more into the `job pods to ensure cluster health is intact.

Diagnostic Steps

  • Check the state of all job pods. They are expected to be in Completed or Succeeded state.
$ oc get pods -A | grep -v completed

openshift-kube-apiserver                          revision-pruner-42-hostname-q9zcj-master-0             0/1    Succeeded  0         11d
openshift-kube-apiserver                          revision-pruner-43-hostname-q9zcj-master-0             0/1    Succeeded  0         11d
openshift-kube-apiserver                          revision-pruner-43-hostname-q9zcj-master-1             0/1    Succeeded  0         11d
openshift-kube-apiserver                          revision-pruner-43-hostname-q9zcj-master-2             0/1    Succeeded  0         11d
openshift-etcd                                    revision-pruner-6-hostname-q9zcj-master-0              0/1    Succeeded  0         11d
openshift-etcd                                    revision-pruner-6-hostname-q9zcj-master-1              0/1    Succeeded  0         11d
openshift-kube-scheduler                          revision-pruner-26-hostname-q9zcj-master-0             0/1    Succeeded  0         11d
openshift-kube-scheduler                          revision-pruner-26-hostname-master-q9zcj-1             0/1    Succeeded  0         11d
openshift-kube-scheduler                          revision-pruner-26-hostname-q9zcj-master-2             0/1    Succeeded  0         11d
openshift-kube-apiserver                          revision-pruner-44-hostname-q9zcj-master-0             0/1    Succeeded  0         2d
openshift-kube-apiserver                          revision-pruner-44-hostname-q9zcj-master-1             0/1    Succeeded  0         2d
openshift-kube-apiserver                          revision-pruner-44-hostname-q9zcj-master-2             0/1    Succeeded  0         2d
openshift-kube-apiserver                          installer-44-hostname-q9zcj-master-2                   0/1    Succeeded  0         2d
openshift-image-registry                          image-pruner-1607040000-pqrcr                          0/1    Succeeded  0         2d
openshift-image-registry                          image-pruner-1607126400-s7w8c                          0/1    Succeeded  0         1d
openshift-kube-apiserver                          revision-pruner-45-hostname-q9zcj-master-0             0/1    Succeeded  0         5h53m
openshift-kube-apiserver                          installer-45-hostname-q9zcj-master-1                   0/1    Succeeded  0         6h6m
openshift-kube-apiserver                          installer-45-hostname-q9zcj-master-2                   0/1    Succeeded  0         6h2m
openshift-kube-apiserver                          revision-pruner-45-hostname-q9zcj-master-1             0/1    Succeeded  0         6h2m
openshift-kube-apiserver                          revision-pruner-45-hostname-q9zcj-master-2             0/1    Succeeded  0         5h58m
openshift-kube-apiserver                          installer-45-hostname-q9zcj-master-0                   0/1    Succeeded  0         5h57m
openshift-image-registry                          image-pruner-1607212800-cgcnr                          0/1    Succeeded  0         3h44m
  • If job pods are not in Completed or Succeeded state, troubleshoot them further by checking their status.conditions using below commands. In this case, image-pruner job pod example is used.
$ oc get pod image-pruner-1607212800-cgcnr  -oyaml 
-------------------------------------8<---------------------------------------
status:
  conditions:
  - lastTransitionTime: '2020-11-24T14:31:00Z'
    message: 'NodeControllerDegraded: All master nodes are ready'
    reason: AsExpected
    status: 'False'
    type: Degraded
  - lastTransitionTime: '2020-12-05T21:51:05Z'
    message: 'NodeInstallerProgressing: 3 nodes are at revision 45'
    reason: AsExpected
    status: 'True'
    type: Available
  - lastTransitionTime: '2020-08-20T14:46:03Z'
    message: 'StaticPodsAvailable: 3 nodes are active; 3 nodes are at revision 45'
    reason: AsExpected
    status: 'True'
    type: Available
  - lastTransitionTime: '2020-08-20T14:43:27Z'
    reason: AsExpected
    status: 'True'
    type: Upgradeable
------------------------------------->8---------------------------------------
  • From above outputs it can be seen that image-pruner job-pod was successfully completed and it's current condition show its functional and available in the cluster.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments