Chapter 19. Finding information on Kafka restarts

After the Cluster Operator restarts a Kafka pod in an OpenShift cluster, it emits an OpenShift event into the pod’s namespace explaining why the pod restarted. For help in understanding cluster behavior, you can check restart events from the command line.

Tip

You can export and monitor restart events using metrics collection tools like Prometheus. Use the metrics tool with an event exporter that can export the output in a suitable format.

19.1. Reasons for a restart event

The Cluster Operator initiates a restart event for a specific reason. You can check the reason by fetching information on the restart event.

The reason given depends on whether you are using StrimziPodSet or StatefulSet resources for the creation and management of pods.

Table 19.1. Restart reasons

StrimziPodSetStatefulSetDescription

CaCertHasOldGeneration

CaCertHasOldGeneration

The pod is still using a server certificate signed with an old CA, so needs to be restarted as part of the certificate update.

CaCertRemoved

CaCertRemoved

Expired CA certificates have been removed, and the pod is restarted to run with the current certificates.

CaCertRenewed

CaCertRenewed

CA certificates have been renewed, and the pod is restarted to run with the updated certificates.

ClientCaCertKeyReplaced

ClientCaCertKeyReplaced

The key used to sign clients CA certificates has been replaced, and the pod is being restarted as part of the CA renewal process.

ClusterCaCertKeyReplaced

ClusterCaCertKeyReplaced

The key used to sign the cluster’s CA certificates has been replaced, and the pod is being restarted as part of the CA renewal process.

ConfigChangeRequiresRestart

ConfigChangeRequiresRestart

Some Kafka configuration properties are changed dynamically, but others require that the broker be restarted.

CustomListenerCaCertChanged

CustomListenerCaCertChanged

The CA certificate used to secure the Kafka network listeners has changed, and the pod is restarted to use it.

FileSystemResizeNeeded

FileSystemResizeNeeded

The file system size has been increased, and a restart is needed to apply it.

KafkaCertificatesChanged

KafkaCertificatesChanged

One or more TLS certificates used by the Kafka broker have been updated, and a restart is needed to use them.

ManualRollingUpdate

ManualRollingUpdate

A user annotated the pod, or the StatefulSet or StrimziPodSet set it belongs to, to trigger a restart.

PodForceRestartOnError

PodForceRestartOnError

An error occurred that requires a pod restart to rectify.

PodHasOldRevision

JbodVolumesChanged

A disk was added or removed from the Kafka volumes, and a restart is needed to apply the change. When using StrimziPodSet resources, the same reason is given if the pod needs to be recreated.

PodHasOldRevision

PodHasOldGeneration

The StatefulSet or StrimziPodSet that the pod is a member of has been updated, so the pod needs to be recreated. When using StrimziPodSet resources, the same reason is given if a disk was added or removed from the Kafka volumes.

PodStuck

PodStuck

The pod is still pending, and is not scheduled or cannot be scheduled, so the operator has restarted the pod in a final attempt to get it running.

PodUnresponsive

PodUnresponsive

AMQ Streams was unable to connect to the pod, which can indicate a broker not starting correctly, so the operator restarted it in an attempt to resolve the issue.

19.2. Restart event filters

When checking restart events from the command line, you can specify a field-selector to filter on OpenShift event fields.

The following fields are available when filtering events with field-selector.

regardingObject.kind
The object that was restarted, and for restart events, the kind is always Pod.
regarding.namespace
The namespace that the pod belongs to.
regardingObject.name
The pod’s name, for example, strimzi-cluster-kafka-0.
regardingObject.uid
The unique ID of the pod.
reason
The reason the pod was restarted, for example, JbodVolumesChanged.
reportingController
The reporting component is always strimzi.io/cluster-operator for AMQ Streams restart events.
source
source is an older version of reportingController. The reporting component is always strimzi.io/cluster-operator for AMQ Streams restart events.
type
The event type, which is either Warning or Normal. For AMQ Streams restart events, the type is Normal.
Note

In older versions of OpenShift, the fields using the regarding prefix might use an involvedObject prefix instead. reportingController was previously called reportingComponent.

19.3. Checking Kafka restarts

Use a oc command to list restart events initiated by the Cluster Operator. Filter restart events emitted by the Cluster Operator by setting the Cluster Operator as the reporting component using the reportingController or source event fields.

Prerequisites

  • The Cluster Operator is running in the OpenShift cluster.

Procedure

  1. Get all restart events emitted by the Cluster Operator:

    oc -n kafka get events --field-selector reportingController=strimzi.io/cluster-operator

    Example showing events returned

    LAST SEEN   TYPE     REASON                   OBJECT                        MESSAGE
    2m          Normal   CaCertRenewed            pod/strimzi-cluster-kafka-0   CA certificate renewed
    58m         Normal   PodForceRestartOnError   pod/strimzi-cluster-kafka-1   Pod needs to be forcibly restarted due to an error
    5m47s       Normal   ManualRollingUpdate      pod/strimzi-cluster-kafka-2   Pod was manually annotated to be rolled

    You can also specify a reason or other field-selector options to constrain the events returned.

    Here, a specific reason is added:

    oc -n kafka get events --field-selector reportingController=strimzi.io/cluster-operator,reason=PodForceRestartOnError
  2. Use an output format, such as YAML, to return more detailed information about one or more events.

    oc -n kafka get events --field-selector reportingController=strimzi.io/cluster-operator,reason=PodForceRestartOnError -o yaml

    Example showing detailed events output

    apiVersion: v1
    items:
    - action: StrimziInitiatedPodRestart
      apiVersion: v1
      eventTime: "2022-05-13T00:22:34.168086Z"
      firstTimestamp: null
      involvedObject:
          kind: Pod
          name: strimzi-cluster-kafka-1
          namespace: kafka
      kind: Event
      lastTimestamp: null
      message: Pod needs to be forcibly restarted due to an error
      metadata:
          creationTimestamp: "2022-05-13T00:22:34Z"
          generateName: strimzi-event
          name: strimzi-eventwppk6
          namespace: kafka
          resourceVersion: "432961"
          uid: 29fcdb9e-f2cf-4c95-a165-a5efcd48edfc
      reason: PodForceRestartOnError
      reportingController: strimzi.io/cluster-operator
      reportingInstance: strimzi-cluster-operator-6458cfb4c6-6bpdp
      source: {}
      type: Normal
    kind: List
    metadata:
      resourceVersion: ""
      selfLink: ""

The following fields are deprecated, so they are not populated for these events:

  • firstTimestamp
  • lastTimestamp
  • source