1.4. Customizing observability

Review the following sections to learn more about customizing, managing, and viewing data that is collected by the observability service.

Collect logs about new information that is created for observability resources with the must-gather command. For more information, see the Must-gather section in the Troubleshooting documentation.

1.4.1. Creating custom rules

You can create custom rules for the observability installation by adding Prometheus recording rules and alerting rules to the observability resource. For more information, see Prometheus configuration.

Note: You can only create custom rules on the metrics that are collected from all managed clusters. View a list of of the metrics that are collected by running the following command: kubectl describe cm observability-metrics-whitelist.

Define custom rules with Prometheus to create alert conditions, and send notifications to an external messaging service. Complete the following steps to create a custom rule:

  1. Log in to your Red Hat Advanced Cluster Management hub cluster.
  2. Create a ConfigMap named thanos-rule-custom-rules in the open-cluster-management-observability namespace. The key must be named, thanos-ruler-custom-rules.yaml, as shown in the following example. You can create multiple rules in the configuration:

    By default, the out-of-the-box alert rules are defined in the ConfigMap in the open-cluster-management-observability namespace.

    For example, you can create a custom alert rule that notifies you when your CPU usage passes your defined value:

      custom_rules.yaml: |
          - name: cluster-health
            - alert: ClusterCPUHealth-jb
                summary: Notify when CPU utilization on a cluster is greater than the defined utilization limit
                description: "The cluster has a high CPU usage: {{ $value }} core for {{ $labels.cluster }} {{ $labels.clusterID }}."
              expr: |
                max(cluster:cpu_usage_cores:sum) by (clusterID, cluster, prometheus) > 0
              for: 5s
                cluster: "{{ $labels.cluster }}"
                prometheus: "{{ $labels.prometheus }}"
                severity: critical

    Note: If this is the first new custom rule, it is created immediately. For changes to the ConfigMap, you must restart the observability pods with the following command: kubectl rollout restart statefulset observability-observatorium-thanos-rule -n open-cluster-management-observability.

  3. If you want to verify that the alert rules is functioning appropriately, complete the following steps:

    1. Access your Grafana dashboard and select the Explore icon.
    2. In the Metrics exploration bar, type in "ALERTS" and run the query. All the ALERTS that are currently in pending or firing state in the system are displayed.
    3. If your alert is not displayed, revisit the rule to see if the expression is accurate.

A custom rule is created. Configuring rules for AlertManager

Integrate external messaging tools such as email, Slack, and PagerDuty to receive notifications from AlertManager. You must override the alertmanager-config secret in the open-cluster-management-observability namespace to add integrations, and configure routes for AlertManager. Complete the following steps to update the custom receiver rules:

  1. Extract the data from the alertmanager-config secret. Run the following command:

    oc -n open-cluster-management-observability get secret alertmanager-config --template='{{ index .data "alertmanager.yaml" }}' |base64 -d > alertmanager.yaml
  2. Edit and save the alertmanager.yaml file configuration by running the following command:

    oc -n open-cluster-management-observability create secret generic alertmanager-config --from-file=alertmanager.yaml --dry-run -o=yaml |  oc -n open-cluster-management-observability replace secret --filename=-

    Your updated secret might resemble the following content:

      smtp_smarthost: 'localhost:25'
      smtp_from: 'alertmanager@example.org'
      smtp_auth_username: 'alertmanager'
      smtp_auth_password: 'password'
    - '/etc/alertmanager/template/*.tmpl'
      group_by: ['alertname', 'cluster', 'service']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 3h
      receiver: team-X-mails
      - match_re:
          service: ^(foo1|foo2|baz)$
        receiver: team-X-mails

Your changes are applied immediately after it is modified. For an example of AlertManager, see prometheus/alertmanager.