Chapter 7. Setting up metrics and dashboards for AMQ Streams

You can monitor your AMQ Streams deployment by viewing key metrics on dashboards and setting up alerts that trigger under certain conditions. Metrics are available for Kafka, ZooKeeper, and the other components of AMQ Streams.

To provide metrics information, AMQ Streams uses Prometheus rules and Grafana dashboards.

When configured with a set of rules for each component of AMQ Streams, Prometheus consumes key metrics from the pods that are running in your cluster. Grafana then visualizes those metrics on dashboards. AMQ Streams includes example Grafana dashboards that you can customize to suit your deployment.

On OpenShift Container Platform 4.x, AMQ Streams employs monitoring for user-defined projects (an OpenShift feature) to simplify the Prometheus setup process.

On OpenShift Container Platform 3.11, you need to deploy the Prometheus and Alertmanager components to your cluster separately.

Regardless of your OpenShift Container Platform version, you have to start by deploying the Prometheus metrics configuration for AMQ Streams.

Next, follow the instructions for your OpenShift Container Platform version:

With Prometheus and Grafana set up, you can use the example Grafana dashboards and alerting rules to monitor your Kafka cluster.

Additional monitoring options

Kafka Exporter is an optional component that provides additional monitoring related to consumer lag. If you want to use Kafka Exporter with AMQ Streams, see Configure the Kafka resource to deploy Kafka Exporter with your Kafka cluster.

You can also configure your deployment to track messages end-to-end by setting up distributed tracing. For more information, see Distributed tracing in the Using AMQ Streams on OpenShift guide.

Additional resources

7.1. Example metrics files

You can find example Grafana dashboards and other metrics configuration files in the examples/metrics directory. As indicated in the following list, some files are only used with OpenShift Container Platform 3.11, and not with OpenShift Container Platform 4.x.

Example metrics files provided with AMQ Streams

metrics
├── grafana-dashboards 1
│   ├── strimzi-cruise-control.json
│   ├── strimzi-kafka-bridge.json
│   ├── strimzi-kafka-connect.json
│   ├── strimzi-kafka-exporter.json
│   ├── strimzi-kafka-mirror-maker-2.json
│   ├── strimzi-kafka.json
│   ├── strimzi-operators.json
│   └── strimzi-zookeeper.json
├── grafana-install
│   └── grafana.yaml 2
├── prometheus-additional-properties
│   └── prometheus-additional.yaml - OPENSHIFT 3.11 ONLY 3
├── prometheus-alertmanager-config
│   └── alert-manager-config.yaml 4
├── prometheus-install
│    ├── alert-manager.yaml - OPENSHIFT 3.11 ONLY 5
│    ├── prometheus-rules.yaml 6
│    ├── prometheus.yaml - OPENSHIFT 3.11 ONLY 7
│    ├── strimzi-pod-monitor.yaml 8
├── kafka-bridge-metrics.yaml 9
├── kafka-connect-metrics.yaml 10
├── kafka-cruise-control-metrics.yaml 11
├── kafka-metrics.yaml 12
└── kafka-mirror-maker-2-metrics.yaml 13

1
Example Grafana dashboards.
2
Installation file for the Grafana image.
3
OPENSHIFT 3.11 ONLY: Additional Prometheus configuration to scrape metrics for CPU, memory, and disk volume usage, which comes directly from the OpenShift cAdvisor agent and kubelet on the nodes.
4
Hook definitions for sending notifications through Alertmanager.
5
OPENSHIFT 3.11 ONLY: Resources for deploying and configuring Alertmanager.
6
Alerting rules examples for use with Prometheus Alertmanager.
7
OPENSHIFT 3.11 ONLY: Installation resource file for the Prometheus image.
8
PodMonitor definitions translated by the Prometheus Operator into jobs for the Prometheus server to be able to scrape metrics data directly from pods.
9
Kafka Bridge resource with metrics enabled.
10
Metrics configuration that defines Prometheus JMX Exporter relabeling rules for Kafka Connect.
11
Metrics configuration that defines Prometheus JMX Exporter relabeling rules for Cruise Control.
12
Metrics configuration that defines Prometheus JMX Exporter relabeling rules for Kafka and ZooKeeper.
13
Metrics configuration that defines Prometheus JMX Exporter relabeling rules for Kafka Mirror Maker 2.0.

7.1.1. Example Grafana dashboards

Example Grafana dashboards are provided for monitoring the following resources:

AMQ Streams Kafka

Shows metrics for:

  • Brokers online count
  • Active controllers in the cluster count
  • Unclean leader election rate
  • Replicas that are online
  • Under-replicated partitions count
  • Partitions which are at their minimum in sync replica count
  • Partitions which are under their minimum in sync replica count
  • Partitions that do not have an active leader and are hence not writable or readable
  • Kafka broker pods memory usage
  • Aggregated Kafka broker pods CPU usage
  • Kafka broker pods disk usage
  • JVM memory used
  • JVM garbage collection time
  • JVM garbage collection count
  • Total incoming byte rate
  • Total outgoing byte rate
  • Incoming messages rate
  • Total produce request rate
  • Byte rate
  • Produce request rate
  • Fetch request rate
  • Network processor average time idle percentage
  • Request handler average time idle percentage
  • Log size
AMQ Streams ZooKeeper

Shows metrics for:

  • Quorum Size of Zookeeper ensemble
  • Number of alive connections
  • Queued requests in the server count
  • Watchers count
  • ZooKeeper pods memory usage
  • Aggregated ZooKeeper pods CPU usage
  • ZooKeeper pods disk usage
  • JVM memory used
  • JVM garbage collection time
  • JVM garbage collection count
  • Amount of time it takes for the server to respond to a client request (maximum, minimum and average)
AMQ Streams Kafka Connect

Shows metrics for:

  • Total incoming byte rate
  • Total outgoing byte rate
  • Disk usage
  • JVM memory used
  • JVM garbage collection time
AMQ Streams Kafka MirrorMaker 2

Shows metrics for:

  • Number of connectors
  • Number of tasks
  • Total incoming byte rate
  • Total outgoing byte rate
  • Disk usage
  • JVM memory used
  • JVM garbage collection time
AMQ Streams Operators

Shows metrics for:

  • Custom resources
  • Successful custom resource reconciliations per hour
  • Failed custom resource reconciliations per hour
  • Reconciliations without locks per hour
  • Reconciliations started hour
  • Periodical reconciliations per hour
  • Maximum reconciliation time
  • Average reconciliation time
  • JVM memory used
  • JVM garbage collection time
  • JVM garbage collection count

Dashboards are also provided for the Kafka Bridge and Cruise Control components of AMQ Streams.

All the dashboards provide JVM metrics, as well as metrics that are specific to each component. For example, the Operators dashboard provides information on the number of reconciliations or custom resources that are being processed.

7.1.2. Example Prometheus metrics configuration

AMQ Streams uses the Prometheus JMX Exporter to expose JMX metrics using an HTTP endpoint, which is then scraped by Prometheus.

Grafana dashboards are dependent on Prometheus JMX Exporter relabeling rules, which are defined for AMQ Streams components as custom resource configuration.

A label is a name-value pair. Relabeling is the process of writing a label dynamically. For example, the value of a label might be derived from the name of a Kafka server and client ID.

AMQ Streams provides example custom resource configuration YAML files with the relabeling rules already defined. When deploying Prometheus metrics configuration, you can deploy the example custom resources or copy the metrics configuration to your own custom resource definitions.

Table 7.1. Example custom resources with metrics configuration

ComponentCustom resourceExample YAML file

Kafka and ZooKeeper

Kafka

kafka-metrics.yaml

Kafka Connect

KafkaConnect and KafkaConnectS2I

kafka-connect-metrics.yaml

Kafka MirrorMaker 2.0

KafkaMirrorMaker2

kafka-mirror-maker-2-metrics.yaml

Kafka Bridge

KafkaBridge

kafka-bridge-metrics.yaml

Cruise Control

Kafka

kafka-cruise-control-metrics.yaml

Additional resources

7.2. Deploying Prometheus metrics configuration

AMQ Streams provides example custom resource configuration YAML files with relabeling rules.

To apply metrics configuration of relabeling rules, do one of the following:

7.2.1. Copying Prometheus metrics configuration to a custom resource

To use Grafana dashboards for monitoring, copy the example metrics configuration to a custom resource.

In this procedure, the Kafka resource is updated, but the procedure is the same for all components that support monitoring.

Procedure

Perform the following steps for each Kafka resource in your deployment.

  1. Update the Kafka resource in an editor.

    oc edit kafka KAFKA-CONFIG-FILE
  2. Copy the example configuration in kafka-metrics.yaml to your own Kafka resource definition.
  3. Save the file, and wait for the updated resource to be reconciled.

7.2.2. Deploying a Kafka cluster with Prometheus metrics configuration

To use Grafana dashboards for monitoring, you can deploy an example Kafka cluster with metrics configuration.

In this procedure, The kafka-metrics.yaml file is used for the Kafka resource.

Procedure

7.3. Viewing Kafka metrics and dashboards in OpenShift 4

When AMQ Streams is deployed to OpenShift Container Platform 4.x, metrics are provided through monitoring for user-defined projects. This OpenShift feature gives developers access to a separate Prometheus instance for monitoring their own projects (for example, a Kafka project).

If monitoring for user-defined projects is enabled, the openshift-user-workload-monitoring project contains the following components:

  • A Prometheus Operator
  • A Prometheus instance (automatically deployed by the Prometheus Operator)
  • A Thanos Ruler instance

AMQ Streams uses these components to consume metrics.

A cluster administrator must enable monitoring for user-defined projects and then grant developers and other users permission to monitor applications within their own projects.

Grafana deployment

You can deploy a Grafana instance to the project containing your Kafka cluster. The example Grafana dashboards can then be used to visualize Prometheus metrics for AMQ Streams in the Grafana user interface.

Important

The openshift-monitoring project provides monitoring for core platform components. Do not use the Prometheus and Grafana components in this project to configure monitoring for AMQ Streams on OpenShift Container Platform 4.x.

Grafana version 6.3 is the minimum supported version.

Prerequisites

Procedure outline

To set up AMQ Streams monitoring in OpenShift Container Platform 4.x, follow these procedures in order:

7.3.1. Deploying the Prometheus resources

Note

Use this procedure when running AMQ Streams on OpenShift Container Platform 4.x.

To enable Prometheus to consume Kafka metrics, you configure and deploy the PodMonitor resources in the example metrics files. The PodMonitors scrape data directly from pods for Apache Kafka, ZooKeeper, Operators, the Kafka Bridge, and Cruise Control.

Then, you deploy the example alerting rules for Alertmanager.

Prerequisites

Procedure

  1. Check that monitoring for user-defined projects is enabled:

    oc get pods -n openshift-user-workload-monitoring

    If enabled, pods for the monitoring components are returned. For example:

    NAME                                   READY   STATUS    RESTARTS   AGE
    prometheus-operator-5cc59f9bc6-kgcq8   1/1     Running   0          25s
    prometheus-user-workload-0             5/5     Running   1          14s
    prometheus-user-workload-1             5/5     Running   1          14s
    thanos-ruler-user-workload-0           3/3     Running   0          14s
    thanos-ruler-user-workload-1           3/3     Running   0          14s

    If no pods are returned, monitoring for user-defined projects is disabled. See the Prerequisites in Section 7.3, “Viewing Kafka metrics and dashboards in OpenShift 4”.

  2. Multiple PodMonitor resources are defined in examples/metrics/prometheus-install/strimzi-pod-monitor.yaml.

    For each PodMonitor resource, edit the spec.namespaceSelector.matchNames property:

    apiVersion: monitoring.coreos.com/v1
    kind: PodMonitor
    metadata:
      name: cluster-operator-metrics
      labels:
        app: strimzi
    spec:
      selector:
        matchLabels:
          strimzi.io/kind: cluster-operator
      namespaceSelector:
        matchNames:
          - PROJECT-NAME 1
      podMetricsEndpoints:
      - path: /metrics
        port: http
    # ...
    1
    The project where the pods to scrape the metrics from are running, for example, Kafka.
  3. Deploy the strimzi-pod-monitor.yaml file to the project where your Kafka cluster is running:

    oc apply -f strimzi-pod-monitor.yaml -n MY-PROJECT
  4. Deploy the example Prometheus rules to the same project:

    oc apply -f prometheus-rules.yaml -n MY-PROJECT

Additional resources

7.3.2. Creating a Service Account for Grafana

Note

Use this procedure when running AMQ Streams on OpenShift Container Platform 4.x.

Your Grafana instance for AMQ Streams needs to run with a Service Account that is assigned the cluster-monitoring-view role.

Procedure

  1. Create a ServiceAccount for Grafana. Here the resource is named grafana-serviceaccount.

    apiVersion: v1
    kind: ServiceAccount
    metadata:
      name: grafana-serviceaccount
      labels:
        app: strimzi
  2. Deploy the ServiceAccount to the project containing your Kafka cluster:

    oc apply -f GRAFANA-SERVICEACCOUNT -n MY-PROJECT
  3. Create a ClusterRoleBinding resource that assigns the cluster-monitoring-view role to the Grafana ServiceAccount. Here the resource is named grafana-cluster-monitoring-binding.

    apiVersion: rbac.authorization.k8s.io/v1
    kind: ClusterRoleBinding
    metadata:
      name: grafana-cluster-monitoring-binding
      labels:
        app: strimzi
    subjects:
      - kind: ServiceAccount
        name: grafana-serviceaccount
        namespace: MY-PROJECT 1
    roleRef:
      kind: ClusterRole
      name: cluster-monitoring-view
      apiGroup: rbac.authorization.k8s.io
    1
    Name of your project.
  4. Deploy the ClusterRoleBinding to the project containing your Kafka cluster:

    oc apply -f GRAFANA-CLUSTER-MONITORING-BINDING -n MY-PROJECT

7.3.3. Deploying Grafana with a Prometheus datasource

Note

Use this procedure when running AMQ Streams on OpenShift Container Platform 4.x.

This procedure describes how to deploy a Grafana application that is configured for the OpenShift Container Platform 4.x monitoring stack.

OpenShift Container Platform 4.x includes a Thanos Querier instance in the openshift-monitoring project. Thanos Querier is used to aggregate platform metrics.

To consume the required platform metrics, your Grafana instance requires a Prometheus data source that can connect to Thanos Querier. To configure this connection, you create a Config Map that authenticates, by using a token, to the oauth-proxy sidecar that runs alongside Thanos Querier. A datasource.yaml file is used as the source of the Config Map.

Finally, you deploy the Grafana application with the Config Map mounted as a volume to the project containing your Kafka cluster.

Procedure

  1. Get the access token of the Grafana ServiceAccount:

    oc serviceaccounts get-token grafana-serviceaccount -n MY-PROJECT

    Copy the access token to use in the next step.

  2. Create a datasource.yaml file containing the Thanos Querier configuration for Grafana.

    Paste the access token into the httpHeaderValue1 property as indicated.

    apiVersion: 1
    
    datasources:
    - name: Prometheus
      type: prometheus
      url: https://thanos-querier.openshift-monitoring.svc.cluster.local:9091
      access: proxy
      basicAuth: false
      withCredentials: false
      isDefault: true
      jsonData:
        timeInterval: 5s
        tlsSkipVerify: true
        httpHeaderName1: "Authorization"
      secureJsonData:
        httpHeaderValue1: "Bearer ${GRAFANA-ACCESS-TOKEN}" 1
      editable: true
    1
    GRAFANA-ACCESS-TOKEN: The value of the access token for the Grafana ServiceAccount.
  3. Create a Config Map named grafana-config from the datasource.yaml file:

    oc create configmap grafana-config --from-file=datasource.yaml -n MY-PROJECT
  4. Create a Grafana application consisting of a Deployment and a Service.

    The grafana-config Config Map is mounted as a volume for the datasource configuration.

    apiVersion: apps/v1
    kind: Deployment
    metadata:
      name: grafana
      labels:
        app: strimzi
    spec:
      replicas: 1
      selector:
        matchLabels:
          name: grafana
      template:
        metadata:
          labels:
            name: grafana
        spec:
          serviceAccountName: grafana-serviceaccount
          containers:
          - name: grafana
            image: grafana/grafana:6.3.0
            ports:
            - name: grafana
              containerPort: 3000
              protocol: TCP
            volumeMounts:
            - name: grafana-data
              mountPath: /var/lib/grafana
            - name: grafana-logs
              mountPath: /var/log/grafana
            - name: grafana-config
              mountPath: /etc/grafana/provisioning/datasources/datasource.yaml
              readOnly: true
              subPath: datasource.yaml
            readinessProbe:
              httpGet:
                path: /api/health
                port: 3000
              initialDelaySeconds: 5
              periodSeconds: 10
            livenessProbe:
              httpGet:
                path: /api/health
                port: 3000
              initialDelaySeconds: 15
              periodSeconds: 20
          volumes:
          - name: grafana-data
            emptyDir: {}
          - name: grafana-logs
            emptyDir: {}
          - name: grafana-config
            configMap:
              name: grafana-config
    ---
    apiVersion: v1
    kind: Service
    metadata:
      name: grafana
      labels:
        app: strimzi
    spec:
      ports:
      - name: grafana
        port: 3000
        targetPort: 3000
        protocol: TCP
      selector:
        name: grafana
      type: ClusterIP
  5. Deploy the Grafana application to the project containing your Kafka cluster:

    oc apply -f GRAFANA-APPLICATION -n MY-PROJECT

Additional resources

7.3.4. Creating a Route to the Grafana Service

Note

Use this procedure when running AMQ Streams on OpenShift Container Platform 4.x.

You can access the Grafana user interface through a Route that exposes the Grafana Service.

Procedure

  • Create an edge route to the grafana service:

    oc create route edge MY-GRAFANA-ROUTE --service=grafana --namespace=KAFKA-NAMESPACE

7.3.5. Importing the example Grafana dashboards

Note

Use this procedure when running AMQ Streams on OpenShift Container Platform 4.x.

Import the example Grafana dashboards using the Grafana user interface.

Procedure

  1. Get the details of the Route to the Grafana Service. For example:

    oc get routes
    
    NAME               HOST/PORT                         PATH  SERVICES
    MY-GRAFANA-ROUTE   MY-GRAFANA-ROUTE-amq-streams.net        grafana
  2. In a web browser, access the Grafana login screen using the URL for the Route host and port.
  3. Enter your user name and password, and then click Log In.

    The default Grafana user name and password are both admin. After logging in for the first time, you can change the password.

  4. In Configuration > Data Sources, check that the Prometheus data source was created. The data source was created in Section 7.3.3, “Deploying Grafana with a Prometheus datasource”.
  5. Click Dashboards > Manage, and then click Import.
  6. In examples/metrics/grafana-dashboards, copy the JSON of the dashboard to import.
  7. Paste the JSON into the text box, and then click Load.
  8. Repeat steps 1 -7 for the other example Grafana dashboards.

The imported Grafana dashboards are available to view from the Dashboards home page.

7.4. Viewing Kafka metrics and dashboards in OpenShift 3.11

When AMQ Streams is deployed to OpenShift Container Platform 3.11, you can use Prometheus to provide monitoring data for the example Grafana dashboards provided with AMQ Streams. You need to manually deploy the Prometheus components to your cluster.

In order to run the example Grafana dashboards, you must:

Note

The resources referenced in this section are intended as a starting point for setting up monitoring, but they are provided as examples only. If you require further support on configuring and running Prometheus or Grafana in production, try reaching out to their respective communities.

7.4.1. Prometheus support

The Prometheus server is not supported when AMQ Streams is deployed to OpenShift Container Platform 3.11. However, the Prometheus endpoint and the Prometheus JMX Exporter used to expose the metrics are supported.

For your convenience, we supply detailed instructions and example metrics configuration files should you wish to use Prometheus for monitoring.

7.4.2. Setting up Prometheus

Note

Use these procedures when running AMQ Streams on OpenShift Container Platform 3.11.

Prometheus provides an open source set of components for systems monitoring and alert notification.

Here we describe how to use the provided Prometheus image and configuration files to run and manage a Prometheus server when AMQ Streams is deployed to OpenShift Container Platform 3.11.

Prerequisites

  • You have deployed compatible versions of Prometheus and Grafana to your OpenShift Container Platform 3.11 cluster.
  • The service account used for running the Prometheus server pod has access to the OpenShift API server. This allows the service account to retrieve the list of pods in the cluster from which it gets metrics.

    For more information, see Discovering services.

7.4.2.1. Prometheus configuration

AMQ Streams provides example configuration files for the Prometheus server.

A Prometheus image is provided for deployment:

  • prometheus.yaml

Additional Prometheus-related configuration is also provided in the following files:

  • prometheus-additional.yaml
  • prometheus-rules.yaml
  • strimzi-pod-monitor.yaml

For Prometheus to obtain monitoring data, you must have deployed a compatible version of Prometheus to your OpenShift Container Platform 3.11 cluster.

Then, use the configuration files to Deploy Prometheus.

7.4.2.2. Prometheus resources

When you apply the Prometheus configuration, the following resources are created in your OpenShift cluster and managed by the Prometheus Operator:

  • A ClusterRole that grants permissions to Prometheus to read the health endpoints exposed by the Kafka and ZooKeeper pods, cAdvisor and the kubelet for container metrics.
  • A ServiceAccount for the Prometheus pods to run under.
  • A ClusterRoleBinding which binds the ClusterRole to the ServiceAccount.
  • A Deployment to manage the Prometheus Operator pod.
  • A PodMonitor to manage the configuration of the Prometheus pod.
  • A Prometheus to manage the configuration of the Prometheus pod.
  • A PrometheusRule to manage alerting rules for the Prometheus pod.
  • A Secret to manage additional Prometheus settings.
  • A Service to allow applications running in the cluster to connect to Prometheus (for example, Grafana using Prometheus as datasource).

7.4.2.3. Deploying Prometheus

To obtain monitoring data in your Kafka cluster, you can use your own Prometheus deployment or deploy Prometheus by applying the example installation resource file for the Prometheus docker image and the YAML files for Prometheus-related resources.

The deployment process creates a ClusterRoleBinding and discovers an Alertmanager instance in the namespace specified for the deployment.

Prerequisites

Procedure

  1. Modify the Prometheus installation file (prometheus.yaml) according to the namespace Prometheus is going to be installed into:

    On Linux, use:

    sed -i 's/namespace: .*/namespace: my-namespace/' prometheus.yaml

    On MacOS, use:

    sed -i '' 's/namespace: .*/namespace: my-namespace/' prometheus.yaml
  2. Edit the PodMonitor resource in strimzi-pod-monitor.yaml to define Prometheus jobs that will scrape the metrics data from pods.

    Update the namespaceSelector.matchNames property with the namespace where the pods to scrape the metrics from are running.

    PodMonitor is used to scrape data directly from pods for Apache Kafka, ZooKeeper, Operators, the Kafka Bridge and Cruise Control.

  3. Edit the prometheus.yaml installation file to include additional configuration for scraping metrics directly from nodes.

    The Grafana dashboards provided show metrics for CPU, memory and disk volume usage, which come directly from the OpenShift cAdvisor agent and kubelet on the nodes.

    1. Create a Secret resource from the configuration file (prometheus-additional.yaml in the examples/metrics/prometheus-additional-properties directory):

      oc apply -f prometheus-additional.yaml
    2. Edit the additionalScrapeConfigs property in the prometheus.yaml file to include the name of the Secret and the prometheus-additional.yaml file.
  4. Deploy the Prometheus resources:

    oc apply -f strimzi-pod-monitor.yaml
    oc apply -f prometheus-rules.yaml
    oc apply -f prometheus.yaml

7.4.3. Setting up Prometheus Alertmanager

Prometheus Alertmanager is a plugin for handling alerts and routing them to a notification service. Alertmanager supports an essential aspect of monitoring, which is to be notified of conditions that indicate potential issues based on alerting rules.

7.4.3.1. Alertmanager configuration

AMQ Streams provides example configuration files for Prometheus Alertmanager.

A configuration file defines the resources for deploying Alertmanager:

  • alert-manager.yaml

An additional configuration file provides the hook definitions for sending notifications from your Kafka cluster.

  • alert-manager-config.yaml

For Alertmanger to handle Prometheus alerts, use the configuration files to:

7.4.3.2. Alerting rules

Alerting rules provide notifications about specific conditions observed in the metrics. Rules are declared on the Prometheus server, but Prometheus Alertmanager is responsible for alert notifications.

Prometheus alerting rules describe conditions using PromQL expressions that are continuously evaluated.

When an alert expression becomes true, the condition is met and the Prometheus server sends alert data to the Alertmanager. Alertmanager then sends out a notification using the communication method configured for its deployment.

Alertmanager can be configured to use email, chat messages or other notification methods.

Additional resources

For more information about setting up alerting rules, see Configuration in the Prometheus documentation.

7.4.3.3. Alerting rule examples

Example alerting rules for Kafka and ZooKeeper metrics are provided with AMQ Streams for use in a Prometheus deployment.

General points about the alerting rule definitions:

  • A for property is used with the rules to determine the period of time a condition must persist before an alert is triggered.
  • A tick is a basic ZooKeeper time unit, which is measured in milliseconds and configured using the tickTime parameter of Kafka.spec.zookeeper.config. For example, if ZooKeeper tickTime=3000, 3 ticks (3 x 3000) equals 9000 milliseconds.
  • The availability of the ZookeeperRunningOutOfSpace metric and alert is dependent on the OpenShift configuration and storage implementation used. Storage implementations for certain platforms may not be able to supply the information on available space required for the metric to provide an alert.

Kafka alerting rules

UnderReplicatedPartitions
Gives the number of partitions for which the current broker is the lead replica but which have fewer replicas than the min.insync.replicas configured for their topic. This metric provides insights about brokers that host the follower replicas. Those followers are not keeping up with the leader. Reasons for this could include being (or having been) offline, and over-throttled interbroker replication. An alert is raised when this value is greater than zero, providing information on the under-replicated partitions for each broker.
AbnormalControllerState
Indicates whether the current broker is the controller for the cluster. The metric can be 0 or 1. During the life of a cluster, only one broker should be the controller and the cluster always needs to have an active controller. Having two or more brokers saying that they are controllers indicates a problem. If the condition persists, an alert is raised when the sum of all the values for this metric on all brokers is not equal to 1, meaning that there is no active controller (the sum is 0) or more than one controller (the sum is greater than 1).
UnderMinIsrPartitionCount
Indicates that the minimum number of in-sync replicas (ISRs) for a lead Kafka broker, specified using min.insync.replicas, that must acknowledge a write operation has not been reached. The metric defines the number of partitions that the broker leads for which the in-sync replicas count is less than the minimum in-sync. An alert is raised when this value is greater than zero, providing information on the partition count for each broker that did not achieve the minimum number of acknowledgments.
OfflineLogDirectoryCount
Indicates the number of log directories which are offline (for example, due to a hardware failure) so that the broker cannot store incoming messages anymore. An alert is raised when this value is greater than zero, providing information on the number of offline log directories for each broker.
KafkaRunningOutOfSpace
Indicates the remaining amount of disk space that can be used for writing data. An alert is raised when this value is lower than 5GiB, providing information on the disk that is running out of space for each persistent volume claim. The threshold value may be changed in prometheus-rules.yaml.

ZooKeeper alerting rules

AvgRequestLatency
Indicates the amount of time it takes for the server to respond to a client request. An alert is raised when this value is greater than 10 (ticks), providing the actual value of the average request latency for each server.
OutstandingRequests
Indicates the number of queued requests in the server. This value goes up when the server receives more requests than it can process. An alert is raised when this value is greater than 10, providing the actual number of outstanding requests for each server.
ZookeeperRunningOutOfSpace
Indicates the remaining amount of disk space that can be used for writing data to ZooKeeper. An alert is raised when this value is lower than 5GiB., providing information on the disk that is running out of space for each persistent volume claim.

7.4.3.4. Deploying Alertmanager

To deploy Alertmanager, apply the example configuration files.

The sample configuration provided with AMQ Streams configures the Alertmanager to send notifications to a Slack channel.

The following resources are defined on deployment:

  • An Alertmanager to manage the Alertmanager pod.
  • A Secret to manage the configuration of the Alertmanager.
  • A Service to provide an easy to reference hostname for other services to connect to Alertmanager (such as Prometheus).

Procedure

  1. Create a Secret resource from the Alertmanager configuration file (alert-manager-config.yaml in the examples/metrics/prometheus-alertmanager-config directory):

    oc apply -f alert-manager-config.yaml
  2. Update the alert-manager-config.yaml file to replace the:

    • slack_api_url property with the actual value of the Slack API URL related to the application for the Slack workspace
    • channel property with the actual Slack channel on which to send notifications
  3. Deploy Alertmanager:

    oc apply -f alert-manager.yaml

7.4.4. Setting up Grafana

Grafana provides visualizations of Prometheus metrics.

You can deploy and enable the example Grafana dashboards provided with AMQ Streams.

7.4.4.1. Deploying Grafana

To provide visualizations of Prometheus metrics, you can use your own Grafana installation or deploy Grafana by applying the grafana.yaml file provided in the examples/metrics directory.

Procedure

  1. Deploy Grafana:

    oc apply -f grafana.yaml
  2. Enable the Grafana dashboards.

7.4.4.2. Enabling the example Grafana dashboards

AMQ Streams provides example dashboard configuration files for Grafana. Example dashboards are provided in the examples/metrics/grafana-dashboards directory as JSON files:

  • strimzi-kafka.json
  • strimzi-zookeeper.json
  • strimzi-operators.json
  • strimzi-kafka-connect.json
  • strimzi-kafka-mirror-maker-2.json
  • strimzi-kafka-bridge.json
  • strimzi-cruise-control.json
  • strimzi-kafka-exporter.json

The example dashboards are a good starting point for monitoring key metrics, but they do not represent all available metrics. You can modify the example dashboards or add other metrics, depending on your infrastructure.

After setting up Prometheus and Grafana, you can visualize the AMQ Streams data on the Grafana dashboards.

Note

No alert notification rules are defined.

When accessing a dashboard, you can use the port-forward command to forward traffic from the Grafana pod to the host.

Note

The name of the Grafana pod is different for each user.

Procedure

  1. Get the details of the Grafana service:

    oc get service grafana

    For example:

    NAMETYPECLUSTER-IPPORT(S)

    grafana

    ClusterIP

    172.30.123.40

    3000/TCP

    Note the port number for port forwarding.

  2. Use port-forward to redirect the Grafana user interface to localhost:3000:

    oc port-forward svc/grafana 3000:3000
  3. Point a web browser to http://localhost:3000.

    The Grafana Log In page appears.

  4. Enter your user name and password, and then click Log In.

    The default Grafana user name and password are both admin. After logging in for the first time, you can change the password.

  5. Add Prometheus as a data source.

    • Specify a name
    • Add Prometheus as the type
    • Specify a Prometheus server URL (http://prometheus-operated:9090)

      Save and test the connection when you have added the details.

      Add Prometheus data source
  6. From DashboardsImport, upload the example dashboards or paste the JSON directly.
  7. On the top header, click the dashboard drop-down menu, and then select the dashboard you want to view.

    When the Prometheus server has been collecting metrics for a AMQ Streams cluster for some time, the dashboards are populated.

Figure 7.1. Dashboard selection options

AMQ Streams dashboard selection
AMQ Streams Kafka

Shows metrics for:

  • Brokers online count
  • Active controllers in the cluster count
  • Unclean leader election rate
  • Replicas that are online
  • Under-replicated partitions count
  • Partitions which are at their minimum in sync replica count
  • Partitions which are under their minimum in sync replica count
  • Partitions that do not have an active leader and are hence not writable or readable
  • Kafka broker pods memory usage
  • Aggregated Kafka broker pods CPU usage
  • Kafka broker pods disk usage
  • JVM memory used
  • JVM garbage collection time
  • JVM garbage collection count
  • Total incoming byte rate
  • Total outgoing byte rate
  • Incoming messages rate
  • Total produce request rate
  • Byte rate
  • Produce request rate
  • Fetch request rate
  • Network processor average time idle percentage
  • Request handler average time idle percentage
  • Log size

    Figure 7.2. AMQ Streams Kafka dashboard

    Kafka dashboard
AMQ Streams ZooKeeper

Shows metrics for:

  • Quorum Size of Zookeeper ensemble
  • Number of alive connections
  • Queued requests in the server count
  • Watchers count
  • ZooKeeper pods memory usage
  • Aggregated ZooKeeper pods CPU usage
  • ZooKeeper pods disk usage
  • JVM memory used
  • JVM garbage collection time
  • JVM garbage collection count
  • Amount of time it takes for the server to respond to a client request (maximum, minimum and average)
AMQ Streams Operators

Shows metrics for:

  • Custom resources
  • Successful custom resource reconciliations per hour
  • Failed custom resource reconciliations per hour
  • Reconciliations without locks per hour
  • Reconciliations started hour
  • Periodical reconciliations per hour
  • Maximum reconciliation time
  • Average reconciliation time
  • JVM memory used
  • JVM garbage collection time
  • JVM garbage collection count
AMQ Streams Kafka Connect

Shows metrics for:

  • Total incoming byte rate
  • Total outgoing byte rate
  • Disk usage
  • JVM memory used
  • JVM garbage collection time
AMQ Streams Kafka MirrorMaker 2

Shows metrics for:

  • Number of connectors
  • Number of tasks
  • Total incoming byte rate
  • Total outgoing byte rate
  • Disk usage
  • JVM memory used
  • JVM garbage collection time
AMQ Streams Kafka Bridge
See Section 7.6, “Monitor Kafka Bridge”.
AMQ Streams Cruise Control
See Section 7.7, “Monitor Cruise Control”.
AMQ Streams Kafka Exporter
See Section 7.5.5, “Enabling the Kafka Exporter Grafana dashboard”.

7.5. Add Kafka Exporter

Kafka Exporter is an open source project to enhance monitoring of Apache Kafka brokers and clients. Kafka Exporter is provided with AMQ Streams for deployment with a Kafka cluster to extract additional metrics data from Kafka brokers related to offsets, consumer groups, consumer lag, and topics.

The metrics data is used, for example, to help identify slow consumers.

Lag data is exposed as Prometheus metrics, which can then be presented in Grafana for analysis.

If you are already using Prometheus and Grafana for monitoring of built-in Kafka metrics, you can configure Prometheus to also scrape the Kafka Exporter Prometheus endpoint.

AMQ Streams includes an example Kafka Exporter dashboard in examples/metrics/grafana-dashboards/strimzi-kafka-exporter.json.

7.5.1. Monitoring Consumer lag

Consumer lag indicates the difference in the rate of production and consumption of messages. Specifically, consumer lag for a given consumer group indicates the delay between the last message in the partition and the message being currently picked up by that consumer.

The lag reflects the position of the consumer offset in relation to the end of the partition log.

Consumer lag between the producer and consumer offset

Consumer lag

This difference is sometimes referred to as the delta between the producer offset and consumer offset: the read and write positions in the Kafka broker topic partitions.

Suppose a topic streams 100 messages a second. A lag of 1000 messages between the producer offset (the topic partition head) and the last offset the consumer has read means a 10-second delay.

The importance of monitoring consumer lag

For applications that rely on the processing of (near) real-time data, it is critical to monitor consumer lag to check that it does not become too big. The greater the lag becomes, the further the process moves from the real-time processing objective.

Consumer lag, for example, might be a result of consuming too much old data that has not been purged, or through unplanned shutdowns.

Reducing consumer lag

Typical actions to reduce lag include:

  • Scaling-up consumer groups by adding new consumers
  • Increasing the retention time for a message to remain in a topic
  • Adding more disk capacity to increase the message buffer

Actions to reduce consumer lag depend on the underlying infrastructure and the use cases AMQ Streams is supporting. For instance, a lagging consumer is less likely to benefit from the broker being able to service a fetch request from its disk cache. And in certain cases, it might be acceptable to automatically drop messages until a consumer has caught up.

7.5.2. Example Kafka Exporter alerting rules

If you performed the steps to introduce metrics to your deployment, you will already have your Kafka cluster configured to use the alert notification rules that support Kafka Exporter.

The rules for Kafka Exporter are defined in prometheus-rules.yaml, and are deployed with Prometheus. For more information, see Prometheus.

The sample alert notification rules specific to Kafka Exporter are as follows:

UnderReplicatedPartition
An alert to warn that a topic is under-replicated and the broker is not replicating to enough partitions. The default configuration is for an alert if there are one or more under-replicated partitions for a topic. The alert might signify that a Kafka instance is down or the Kafka cluster is overloaded. A planned restart of the Kafka broker may be required to restart the replication process.
TooLargeConsumerGroupLag
An alert to warn that the lag on a consumer group is too large for a specific topic partition. The default configuration is 1000 records. A large lag might indicate that consumers are too slow and are falling behind the producers.
NoMessageForTooLong
An alert to warn that a topic has not received messages for a period of time. The default configuration for the time period is 10 minutes. The delay might be a result of a configuration issue preventing a producer from publishing messages to the topic.

Adapt the default configuration of these rules according to your specific needs.

7.5.3. Exposing Kafka Exporter metrics

Lag information is exposed by Kafka Exporter as Prometheus metrics for presentation in Grafana.

Kafka Exporter exposes metrics data for brokers, topics and consumer groups. These metrics are displayed on the example strimzi-kafka-exporter dashboard.

The data extracted is described here.

Table 7.2. Broker metrics output

NameInformation

kafka_brokers

Number of brokers in the Kafka cluster

Table 7.3. Topic metrics output

NameInformation

kafka_topic_partitions

Number of partitions for a topic

kafka_topic_partition_current_offset

Current topic partition offset for a broker

kafka_topic_partition_oldest_offset

Oldest topic partition offset for a broker

kafka_topic_partition_in_sync_replica

Number of in-sync replicas for a topic partition

kafka_topic_partition_leader

Leader broker ID of a topic partition

kafka_topic_partition_leader_is_preferred

Shows 1 if a topic partition is using the preferred broker

kafka_topic_partition_replicas

Number of replicas for this topic partition

kafka_topic_partition_under_replicated_partition

Shows 1 if a topic partition is under-replicated

Table 7.4. Consumer group metrics output

NameInformation

kafka_consumergroup_current_offset

Current topic partition offset for a consumer group

kafka_consumergroup_lag

Current approximate lag for a consumer group at a topic partition

Consumer group metrics are only displayed on the Kafka Exporter dashboard if at least one consumer group has a lag greater than zero.

7.5.4. Configuring Kafka Exporter

This procedure shows how to configure Kafka Exporter in the Kafka resource through KafkaExporter properties.

For more information about configuring the Kafka resource, see Kafka cluster configuration in the Using AMQ Streams on OpenShift guide.

The properties relevant to the Kafka Exporter configuration are shown in this procedure.

You can configure these properties as part of a deployment or redeployment of the Kafka cluster.

Prerequisites

  • An OpenShift cluster
  • A running Cluster Operator

Procedure

  1. Edit the KafkaExporter properties for the Kafka resource.

    The properties you can configure are shown in this example configuration:

    apiVersion: kafka.strimzi.io/v1beta2
    kind: Kafka
    metadata:
      name: my-cluster
    spec:
      # ...
      kafkaExporter:
        image: my-registry.io/my-org/my-exporter-cluster:latest 1
        groupRegex: ".*" 2
        topicRegex: ".*" 3
        resources: 4
          requests:
            cpu: 200m
            memory: 64Mi
          limits:
            cpu: 500m
            memory: 128Mi
        logging: debug 5
        enableSaramaLogging: true 6
        template: 7
          pod:
            metadata:
              labels:
                label1: value1
            imagePullSecrets:
              - name: my-docker-credentials
            securityContext:
              runAsUser: 1000001
              fsGroup: 0
            terminationGracePeriodSeconds: 120
        readinessProbe: 8
          initialDelaySeconds: 15
          timeoutSeconds: 5
        livenessProbe: 9
          initialDelaySeconds: 15
          timeoutSeconds: 5
    # ...
    1
    ADVANCED OPTION: Container image configuration, which is recommended only in special situations.
    2
    A regular expression to specify the consumer groups to include in the metrics.
    3
    A regular expression to specify the topics to include in the metrics.
    4
    5
    Logging configuration, to log messages with a given severity (debug, info, warn, error, fatal) or above.
    6
    Boolean to enable Sarama logging, a Go client library used by Kafka Exporter.
    7
    8
    9
  2. Create or update the resource:

    oc apply -f kafka.yaml

What to do next

After configuring and deploying Kafka Exporter, you can enable Grafana to present the Kafka Exporter dashboards.

7.5.5. Enabling the Kafka Exporter Grafana dashboard

AMQ Streams provides example dashboard configuration files for Grafana. The Kafka Exporter dashboard is provided in the examples/metrics directory as a JSON file:

  • strimzi-kafka-exporter.json

If you deployed Kafka Exporter with your Kafka cluster, you can visualize the metrics data it exposes on the Grafana dashboard.

This procedure assumes you already have access to the Grafana user interface and Prometheus has been added as a data source. If you are accessing the user interface for the first time, see Grafana.

Procedure

  1. Access the Grafana user interface.
  2. Select the Strimzi Kafka Exporter dashboard.

    When metrics data has been collected for some time, the Kafka Exporter charts are populated.

    AMQ Streams Kafka Exporter

    Shows metrics for:

    • Topic count
    • Partition count
    • Replicas count
    • In-sync replicas count
    • Under-replicated partitions count
    • Partitions which are at their minimum in sync replica count
    • Partitions which are under their minimum in sync replica count
    • Partitions not on a preferred node
    • Messages in per second from topics
    • Messages consumed per second from topics
    • Messages consumed per minute by consumer groups
    • Lag by consumer group
    • Number of partitions
    • Latest offsets
    • Oldest offsets

Use the Grafana charts to analyze lag and to check if actions to reduce lag are having an impact on an affected consumer group. If, for example, Kafka brokers are adjusted to reduce lag, the dashboard will show the Lag by consumer group chart going down and the Messages consumed per minute chart going up.

7.6. Monitor Kafka Bridge

If you are already using Prometheus and Grafana for monitoring of built-in Kafka metrics, you can configure Prometheus to also scrape the Kafka Bridge Prometheus endpoint.

The example Grafana dashboard for the Kafka Bridge provides:

  • Information about HTTP connections and related requests to the different endpoints
  • Information about the Kafka consumers and producers used by the bridge
  • JVM metrics from the bridge itself

7.6.1. Configuring Kafka Bridge

You can enable the Kafka Bridge metrics in the KafkaBridge resource using the enableMetrics property.

You can configure this property as part of a deployment or redeployment of the Kafka Bridge.

For example:

apiVersion: kafka.strimzi.io/v1beta2
kind: KafkaBridge
metadata:
  name: my-bridge
spec:
  # ...
  bootstrapServers: my-cluster-kafka:9092
  http:
    # ...
  enableMetrics: true
  # ...

7.6.2. Enabling the Kafka Bridge Grafana dashboard

If you deployed Kafka Bridge with your Kafka cluster, you can enable Grafana to present the metrics data it exposes.

A Kafka Bridge dashboard is provided in the examples/metrics directory as a JSON file:

  • strimzi-kafka-bridge.json

When metrics data has been collected for some time, the Kafka Bridge charts are populated.

Kafka Bridge

Shows metrics for:

  • HTTP connections to the Kafka Bridge count
  • HTTP requests being processed count
  • Requests processed per second grouped by HTTP method
  • The total request rate grouped by response codes (2XX, 4XX, 5XX)
  • Bytes received and sent per second
  • Requests for each Kafka Bridge endpoint
  • Number of Kafka consumers, producers, and related opened connections used by the Kafka Bridge itself
  • Kafka producer:

    • The average number of records sent per second (grouped by topic)
    • The number of outgoing bytes sent to all brokers per second (grouped by topic)
    • The average number of records per second that resulted in errors (grouped by topic)
  • Kafka consumer:

    • The average number of records consumed per second (grouped by clientId-topic)
    • The average number of bytes consumed per second (grouped by clientId-topic)
    • Partitions assigned (grouped by clientId)
  • JVM memory used
  • JVM garbage collection time
  • JVM garbage collection count

7.7. Monitor Cruise Control

If you are already using Prometheus and Grafana for monitoring of built-in Kafka metrics, you can configure Prometheus to also scrape the Cruise Control Prometheus endpoint.

The example Grafana dashboard for Cruise Control provides:

  • Information about optimization proposals computation, goals violation, cluster balancedness, and more
  • Information about REST API calls for rebalance proposals and actual rebalance operations
  • JVM metrics from Cruise Control itself

7.7.1. Configuring Cruise Control

Enable Cruise Control metrics using the cruiseControl.metricsConfig property in the Kafka resource to provide a reference to a ConfigMap that contains JMX exporter configuration for the metrics to expose.

For example:

apiVersion: kafka.strimzi.io/v1beta2
kind: Kafka
metadata:
  name: my-cluster
spec:
  # ...
  kafka:
    # ...
  zookeeper:
    # ...
  cruiseControl:
    metricsConfig:
       type: jmxPrometheusExporter
       valueFrom:
         configMapKeyRef:
           name: my-config-map
           key: my-key

7.7.2. Enabling the Cruise Control Grafana dashboard

If you deployed Cruise Control with your Kafka cluster with the metrics enabled, you can enable Grafana to present the metrics data it exposes.

A Cruise Control dashboard is provided in the examples/metrics directory as a JSON file:

  • strimzi-cruise-control.json

When metrics data has been collected for some time, the Cruise Control charts are populated.

Cruise Control

Shows metrics for:

  • Number of snapshot windows that are monitored by Cruise Control
  • Number of time windows considered valid because they contain enough samples to compute an optimization proposal
  • Number of ongoing executions running for proposals or rebalances
  • Current balancedness score of the Kafka cluster as calculated by the anomaly detector component of Cruise Control (every 5 minutes by default)
  • Percentage of monitored partitions
  • Number of goal violations reported by the anomaly detector (every 5 minutes by default)
  • How often a disk read failure happens on the brokers
  • Rate of metric sample fetch failures
  • Time needed to compute an optimization proposal
  • Time needed to create the cluster model
  • How often a proposal request or an actual rebalance request is made through the Cruise Control REST API
  • How often the overall cluster state and the user tasks state are requested through the Cruise Control REST API
  • JVM memory used
  • JVM garbage collection time
  • JVM garbage collection count