Chapter 5. Using operational features of Service Telemetry Framework

You can use the following operational features to provide additional functionality to the Service Telemetry Framework (STF):

5.1. Dashboards in Service Telemetry Framework

Use the third-party application, Grafana, to visualize system-level metrics that collectd and Ceilometer gathers for each individual host node.

For more information about configuring collectd, see Section 4.1, “Deploying Red Hat OpenStack Platform overcloud for Service Telemetry Framework”.

You can use two dashboards to monitor a cloud:

Infrastructure dashboard
Use the infrastructure dashboard to view metrics for a single node at a time. Select a node from the upper left corner of the dashboard.
Cloud view dashboard

Use the cloud view dashboard to view panels to monitor service resource usage, API stats, and cloud events. You must enable API health monitoring and service monitoring to provide the data for this dashboard. API health monitoring is enabled by default in the STF base configuration. For more information, see Section 4.1.2, “Creating the base configuration for STF”.

5.1.1. Configuring Grafana to host the dashboard

Grafana is not included in the default Service Telemetry Framework (STF) deployment so you must deploy the Grafana Operator from When you use the Service Telemetry Operator to deploy Grafana, it results in a Grafana instance and the configuration of the default data sources for the local STF deployment.


  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Deploy the Grafana operator:

    $ oc apply -f - <<EOF
    kind: Subscription
      name: grafana-operator
      namespace: service-telemetry
      channel: alpha
      installPlanApproval: Automatic
      name: grafana-operator
      source: operatorhubio-operators
      sourceNamespace: openshift-marketplace
  4. Verify that the Operator launched successfully. In the command output, if the value of the PHASE column is Succeeded, the Operator launched successfully:

    $ oc get csv --selector
    NAME                       DISPLAY            VERSION   REPLACES                   PHASE
    grafana-operator.v3.10.3   Grafana Operator   3.10.3    grafana-operator.v3.10.2   Succeeded
  5. To launch a Grafana instance, create or modify the ServiceTelemetry object. Set graphing.enabled and graphing.grafana.ingressEnabled to true:

    $ oc edit stf default
    kind: ServiceTelemetry
        enabled: true
          ingressEnabled: true
  6. Verify that the Grafana instance deployed:

    $ oc get pod -l app=grafana
    NAME                                  READY   STATUS    RESTARTS   AGE
    grafana-deployment-7fc7848b56-sbkhv   1/1     Running   0          1m
  7. Verify that the Grafana data sources installed correctly:

    $ oc get grafanadatasources
    NAME                    AGE
    default-datasources     20h
  8. Verify that the Grafana route exists:

    $ oc get route grafana-route
    NAME            HOST/PORT                                          PATH   SERVICES          PORT   TERMINATION   WILDCARD
    grafana-route          grafana-service   3000   edge          None

5.1.2. Overriding the default Grafana container image

The dashboards in Service Telemetry Framework (STF) require features that are available only in Grafana version 8.1.0 and later. By default, the Service Telemetry Operator installs a compatible version. You can override the base Grafana image by specifying the image path to an image registry with graphing.grafana.baseImage.


  1. Ensure that you have the correct version of Grafana:

    $ oc get pod -l "app=grafana" -ojsonpath='{.items[0].spec.containers[0].image}'
  2. If the running image is older than 8.1.0, patch the ServiceTelemetry object to update the image. Service Telemetry Operator updates the Grafana manifest, which restarts the Grafana deployment:

    $ oc patch stf/default --type merge -p '{"spec":{"graphing":{"grafana":{"baseImage":""}}}}'
  3. Verify that a new Grafana pod exists and has a STATUS value of Running:

    $ oc get pod -l "app=grafana"
    NAME                                 READY     STATUS    RESTARTS   AGE
    grafana-deployment-fb9799b58-j2hj2   1/1       Running   0          10s
  4. Verify that the new instance is running the updated image:

    $ oc get pod -l "app=grafana" -ojsonpath='{.items[0].spec.containers[0].image}'

5.1.3. Importing dashboards

The Grafana Operator can import and manage dashboards by creating GrafanaDashboard objects. You can view example dashboards at


  1. Import the infrastructure dashboard:

    $ oc apply -f created
  2. Import the cloud dashboard:


    For some panels in the cloud dashboard, you must set the value of the collectd virt plugin parameter hostname_format to name uuid hostname in the stf-connectors.yaml file. If you do not configure this parameter, affected dashboards remain empty. For more information about the virt plugin, see collectd plugins.

    $ oc apply -f created
  3. Import the cloud events dashboard:

    $ oc apply -f created
  4. Import the virtual machine dashboard:

    $ oc apply -f configured
  5. Import the memcached dashboard:

    $ oc apply -f created
  6. Verify that the dashboards are available:

    $ oc get grafanadashboards
    NAME                   AGE
    memcached-dashboard-1.3      115s
    rhos-cloud-dashboard-1.3     2m12s
    rhos-cloudevents-dashboard   2m6s
    rhos-dashboard-1.3           2m17s
    virtual-machine-view-1.3     2m
  7. Retrieve the Grafana route address:

    $ oc get route grafana-route -ojsonpath='{}'
  8. In a web browser, navigate to https://<grafana_route_address>. Replace <grafana_route_address> with the value that you retrieved in the previous step.
  9. To view the dashboard, click Dashboards and Manage.

5.1.4. Retrieving and setting Grafana login credentials

Service Telemetry Framework (STF) sets default login credentials when Grafana is enabled. You can override the credentials in the ServiceTelemetry object.


  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Retrieve the default username and password from the STF object:

    $ oc get stf default -o jsonpath="{.spec.graphing.grafana['adminUser','adminPassword']}"
  4. To modify the default values of the Grafana administrator username and password through the ServiceTelemetry object, use the graphing.grafana.adminUser and graphing.grafana.adminPassword parameters.

5.2. Metrics retention time period in Service Telemetry Framework

The default retention time for metrics stored in Service Telemetry Framework (STF) is 24 hours, which provides enough data for trends to develop for the purposes of alerting.

For long-term storage, use systems designed for long-term data retention, for example, Thanos.

5.2.1. Editing the metrics retention time period in Service Telemetry Framework

You can adjust Service Telemetry Framework (STF) for additional metrics retention time.


  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Edit the ServiceTelemetry object:

    $ oc edit stf default
  4. Add retention: 7d to the storage section of to increase the retention period to seven days:


    If you set a long retention period, retrieving data from heavily populated Prometheus systems can result in queries returning results slowly.

    kind: ServiceTelemetry
      name: stf-default
      namespace: service-telemetry
            enabled: true
              strategy: persistent
              retention: 7d
  5. Save your changes and close the object.

5.3. Alerts in Service Telemetry Framework

You create alert rules in Prometheus and alert routes in Alertmanager. Alert rules in Prometheus servers send alerts to an Alertmanager, which manages the alerts. Alertmanager can silence, inhibit, or aggregate alerts, and send notifications by using email, on-call notification systems, or chat platforms.

To create an alert, complete the following tasks:

  1. Create an alert rule in Prometheus. For more information, see Section 5.3.1, “Creating an alert rule in Prometheus”.
  2. Create an alert route in Alertmanager. There are two ways in which you can create an alert route:

For more information about alerts or notifications with Prometheus and Alertmanager, see

To view an example set of alerts that you can use with Service Telemetry Framework (STF), see

5.3.1. Creating an alert rule in Prometheus

Prometheus evaluates alert rules to trigger notifications. If the rule condition returns an empty result set, the condition is false. Otherwise, the rule is true and it triggers an alert.


  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Create a PrometheusRule object that contains the alert rule. The Prometheus Operator loads the rule into Prometheus:

    $ oc apply -f - <<EOF
    kind: PrometheusRule
      creationTimestamp: null
        prometheus: default
        role: alert-rules
      name: prometheus-alarm-rules
      namespace: service-telemetry
        - name: ./openstack.rules
            - alert: Collectd metrics receive rate is zero
              expr: rate(sg_total_collectd_msg_received_count[1m]) == 0 1
    To change the rule, edit the value of the expr parameter.
  4. To verify that the Operator loaded the rules into Prometheus, run the curl command against the default-prometheus-proxy route with basic authentication:

    $ curl -k --user "internal:$(oc get secret default-prometheus-htpasswd -ogo-template='{{ .data.password | base64decode }}')" https://$(oc get route default-prometheus-proxy -ogo-template='{{ }}')/api/v1/rules
    {"status":"success","data":{"groups":[{"name":"./openstack.rules","file":"/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml","rules":[{"state":"inactive","name":"Collectd metrics receive count is zero","query":"rate(sg_total_collectd_msg_received_count[1m]) == 0","duration":0,"labels":{},"annotations":{},"alerts":[],"health":"ok","evaluationTime":0.00034627,"lastEvaluation":"2021-12-07T17:23:22.160448028Z","type":"alerting"}],"interval":30,"evaluationTime":0.000353787,"lastEvaluation":"2021-12-07T17:23:22.160444017Z"}]}}

5.3.2. Configuring custom alerts

You can add custom alerts to the PrometheusRule object that you created in Section 5.3.1, “Creating an alert rule in Prometheus”.


  1. Use the oc edit command:

    $ oc edit prometheusrules prometheus-alarm-rules
  2. Edit the PrometheusRules manifest.
  3. Save and close the manifest.

5.3.3. Creating a standard alert route in Alertmanager

Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as a Red Hat OpenShift Container Platform secret. By default, Service Telemetry Framework (STF) deploys a basic configuration that results in no receivers:

alertmanager.yaml: |-
    resolve_timeout: 5m
    group_by: ['job']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    receiver: 'null'
  - name: 'null'

To deploy a custom Alertmanager route with STF, you must pass an alertmanagerConfigManifest parameter to the Service Telemetry Operator that results in an updated secret, managed by the Prometheus Operator.


If your alertmanagerConfigManifest contains a custom template to construct the title and text of the sent alert, deploy the contents of the alertmanagerConfigManifest using a base64-encoded configuration. For more information, see Section 5.3.4, “Creating an alert route with templating in Alertmanager”.


  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Edit the ServiceTelemetry object for your STF deployment:

    $ oc edit stf default
  4. Add the new parameter alertmanagerConfigManifest and the Secret object contents to define the alertmanager.yaml configuration for Alertmanager:


    This step loads the default template that the Service Telemetry Operator manages. To verify that the changes are populating correctly, change a value, return the alertmanager-default secret, and verify that the new value is loaded into memory. For example, change the value of the parameter global.resolve_timeout from 5m to 10m.

    kind: ServiceTelemetry
      name: default
      namespace: service-telemetry
            enabled: true
      alertmanagerConfigManifest: |
        apiVersion: v1
        kind: Secret
          name: 'alertmanager-default'
          namespace: 'service-telemetry'
        type: Opaque
          alertmanager.yaml: |-
              resolve_timeout: 10m
              group_by: ['job']
              group_wait: 30s
              group_interval: 5m
              repeat_interval: 12h
              receiver: 'null'
            - name: 'null'
  5. Verify that the configuration has been applied to the secret:

    $ oc get secret alertmanager-default -o go-template='{{index .data "alertmanager.yaml" | base64decode }}'
      resolve_timeout: 10m
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'null'
    - name: 'null'
  6. Run the curl command against the alertmanager-proxy service to retrieve the status and configYAML contents, and verify that the supplied configuration matches the configuration in Alertmanager:

    $ oc run curl -it --serviceaccount=prometheus-k8s --restart='Never' --image=radial/busyboxplus:curl -- sh -c "curl -k -H \"Content-Type: application/json\" -H \"Authorization: Bearer \$(cat /var/run/secrets/\" https://default-alertmanager-proxy:9095/api/v1/status"
  7. Verify that the configYAML field contains the changes you expect.
  8. To clean up the environment, delete the curl pod:

    $ oc delete pod curl
    pod "curl" deleted

5.3.4. Creating an alert route with templating in Alertmanager

Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as a Red Hat OpenShift Container Platform secret. By default, Service Telemetry Framework (STF) deploys a basic configuration that results in no receivers:

alertmanager.yaml: |-
    resolve_timeout: 5m
    group_by: ['job']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    receiver: 'null'
  - name: 'null'

If the alertmanagerConfigManifest parameter contains a custom template, for example, to construct the title and text of the sent alert, deploy the contents of the alertmanagerConfigManifest by using a base64-encoded configuration.


  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Edit the ServiceTelemetry object for your STF deployment:

    $ oc edit stf default
  4. To deploy a custom Alertmanager route with STF, you must pass an alertmanagerConfigManifest parameter to the Service Telemetry Operator that results in an updated secret that is managed by the Prometheus Operator:

    kind: ServiceTelemetry
      name: default
      namespace: service-telemetry
            enabled: true
      alertmanagerConfigManifest: |
        apiVersion: v1
        kind: Secret
          name: 'alertmanager-default'
          namespace: 'service-telemetry'
        type: Opaque
          alertmanager.yaml: Z2xvYmFsOgogIHJlc29sdmVfdGltZW91dDogMTBtCiAgc2xhY2tfYXBpX3VybDogPHNsYWNrX2FwaV91cmw+CnJlY2VpdmVyczoKICAtIG5hbWU6IHNsYWNrCiAgICBzbGFja19jb25maWdzOgogICAgLSBjaGFubmVsOiAjc3RmLWFsZXJ0cwogICAgICB0aXRsZTogfC0KICAgICAgICAuLi4KICAgICAgdGV4dDogPi0KICAgICAgICAuLi4Kcm91dGU6CiAgZ3JvdXBfYnk6IFsnam9iJ10KICBncm91cF93YWl0OiAzMHMKICBncm91cF9pbnRlcnZhbDogNW0KICByZXBlYXRfaW50ZXJ2YWw6IDEyaAogIHJlY2VpdmVyOiAnc2xhY2snCg==
  5. Verify that the configuration has been applied to the secret:

    $ oc get secret alertmanager-default -o go-template='{{index .data "alertmanager.yaml" | base64decode }}'
      resolve_timeout: 10m
      slack_api_url: <slack_api_url>
      - name: slack
        - channel: #stf-alerts
          title: |-
          text: >-
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'slack'
  6. Run the curl command against the alertmanager-proxy service to retrieve the status and configYAML contents, and verify that the supplied configuration matches the configuration in Alertmanager:

    $ oc run curl -it --serviceaccount=prometheus-k8s --restart='Never' --image=radial/busyboxplus:curl -- sh -c "curl -k -H \"Content-Type: application/json\" -H \"Authorization: Bearer \$(cat /var/run/secrets/\" https://default-alertmanager-proxy:9095/api/v1/status"
  7. Verify that the configYAML field contains the changes you expect.
  8. To clean up the environment, delete the curl pod:

    $ oc delete pod curl
    pod "curl" deleted

5.4. Configuring SNMP traps

You can integrate Service Telemetry Framework (STF) with an existing infrastructure monitoring platform that receives notifications through SNMP traps. To enable SNMP traps, modify the ServiceTelemetry object and configure the snmpTraps parameters.

For more information about configuring alerts, see Section 5.3, “Alerts in Service Telemetry Framework”.


  • Know the IP address or hostname of the SNMP trap receiver where you want to send the alerts


  1. To enable SNMP traps, modify the ServiceTelemetry object:

    $ oc edit stf default
  2. Set the alerting.alertmanager.receivers.snmpTraps parameters:

    kind: ServiceTelemetry
              enabled: true
  3. Ensure that you set the value of target to the IP address or hostname of the SNMP trap receiver.

5.5. High availability

With high availability, Service Telemetry Framework (STF) can rapidly recover from failures in its component services. Although Red Hat OpenShift Container Platform restarts a failed pod if nodes are available to schedule the workload, this recovery process might take more than one minute, during which time events and metrics are lost. A high availability configuration includes multiple copies of STF components, which reduces recovery time to approximately 2 seconds. To protect against failure of an Red Hat OpenShift Container Platform node, deploy STF to an Red Hat OpenShift Container Platform cluster with three or more nodes.


STF is not yet a fully fault tolerant system. Delivery of metrics and events during the recovery period is not guaranteed.

Enabling high availability has the following effects:

  • Three ElasticSearch pods run instead of the default one.
  • The following components run two pods instead of the default one:

    • AMQ Interconnect
    • Alertmanager
    • Prometheus
    • Events Smart Gateway
    • Metrics Smart Gateway
  • Recovery time from a lost pod in any of these services reduces to approximately 2 seconds.

5.5.1. Configuring high availability

To configure Service Telemetry Framework (STF) for high availability, add highAvailability.enabled: true to the ServiceTelemetry object in Red Hat OpenShift Container Platform. You can set this parameter at installation time or, if you already deployed STF, complete the following steps:


  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Use the oc command to edit the ServiceTelemetry object:

    $ oc edit stf default
  4. Add highAvailability.enabled: true to the spec section:

    kind: ServiceTelemetry
        enabled: true
  5. Save your changes and close the object.

5.6. Ephemeral storage

You can use ephemeral storage to run Service Telemetry Framework (STF) without persistently storing data in your Red Hat OpenShift Container Platform cluster.


If you use ephemeral storage, you might experience data loss if a pod is restarted, updated, or rescheduled onto another node. Use ephemeral storage only for development or testing, and not production environments.

5.6.1. Configuring ephemeral storage

To configure STF components for ephemeral storage, add ephemeral to the corresponding parameter. For example, to enable ephemeral storage for the Prometheus back end, set ephemeral. Components that support configuration of ephemeral storage include alerting.alertmanager, backends.metrics.prometheus, and You can add ephemeral storage configuration at installation time or, if you already deployed STF, complete the following steps:


  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Edit the ServiceTelemetry object:

    $ oc edit stf default
  4. Add the ephemeral parameter to the spec section of the relevant component:

    kind: ServiceTelemetry
      name: stf-default
      namespace: service-telemetry
        enabled: true
            strategy: ephemeral
            enabled: true
              strategy: ephemeral
            enabled: true
              strategy: ephemeral
  5. Save your changes and close the object.

5.7. Observability Strategy in Service Telemetry Framework

Service Telemetry Framework (STF) does not include storage backends and alerting tools. STF uses community operators to deploy Prometheus, Alertmanager, Grafana, and Elasticsearch. STF makes requests to these community operators to create instances of each application configured to work with STF.

Instead of having Service Telemetry Operator create custom resource requests, you can use your own deployments of these applications or other compatible applications, and scrape the metrics Smart Gateways for delivery to your own Prometheus-compatible system for telemetry storage. If you set the observability strategy to use alternative backends instead, persistent or ephemeral storage is not required for STF.

5.7.1. Configuring an alternate observability strategy

To configure STF to skip the deployment of storage, visualization, and alerting backends, add observabilityStrategy: none to the ServiceTelemetry spec. In this mode, only AMQ Interconnect routers and metrics Smart Gateways are deployed, and you must configure an external Prometheus-compatible system to collect metrics from the STF Smart Gateways.


Currently, only metrics are supported when you set observabilityStrategy to none. Events Smart Gateways are not deployed.


  1. Create a ServiceTelemetry object with the property observabilityStrategy: none in the spec parameter. The manifest shows results in a default deployment of STF that is suitable for receiving telemetry from a single cloud with all metrics collector types.

    $ oc apply -f - <<EOF
    kind: ServiceTelemetry
      name: default
      namespace: service-telemetry
      observabilityStrategy: none
  2. To verify that all workloads are operating correctly, view the pods and the status of each pod:

    $ oc get pods
    NAME                                                      READY   STATUS    RESTARTS   AGE
    default-cloud1-ceil-meter-smartgateway-59c845d65b-gzhcs   3/3     Running   0          132m
    default-cloud1-coll-meter-smartgateway-75bbd948b9-d5phm   3/3     Running   0          132m
    default-cloud1-sens-meter-smartgateway-7fdbb57b6d-dh2g9   3/3     Running   0          132m
    default-interconnect-668d5bbcd6-57b2l                     1/1     Running   0          132m
    interconnect-operator-b8f5bb647-tlp5t                     1/1     Running   0          47h
    service-telemetry-operator-566b9dd695-wkvjq               1/1     Running   0          156m
    smart-gateway-operator-58d77dcf7-6xsq7                    1/1     Running   0          47h

For more information about configuring additional clouds or to change the set of supported collectors, see Section 4.4.2, “Deploying Smart Gateways”

5.8. Resource usage of Red Hat OpenStack Platform services

You can monitor the resource usage of the Red Hat OpenStack Platform (RHOSP) services, such as the APIs and other infrastructure processes, to identify bottlenecks in the overcloud by showing services that run out of compute power. Resource usage monitoring is enabled by default.

5.8.1. Disabling resource usage monitoring of Red Hat OpenStack Platform services

To disable the monitoring of RHOSP containerized service resource usage, you must set the CollectdEnableLibpodstats parameter to false.



  1. Open the stf-connectors.yaml file and add the CollectdEnableLibpodstats parameter to override the setting in enable-stf.yaml. Ensure that stf-connectors.yaml is called from the openstack overcloud deploy command after enable-stf.yaml:

      CollectdEnableLibpodstats: false
  2. Continue with the overcloud deployment procedure. For more information, see Section 4.1.4, “Deploying the overcloud”.

5.9. Red Hat OpenStack Platform API status and containerized services health

You can use the OCI (Open Container Initiative) standard to assess the container health status of each Red Hat OpenStack Platform (RHOSP) service by periodically running a health check script. Most RHOSP services implement a health check that logs issues and returns a binary status. For the RHOSP APIs, the health checks query the root endpoint and determine the health based on the response time.

Monitoring of RHOSP container health and API status is enabled by default.

5.9.1. Disabling container health and API status monitoring

To disable RHOSP containerized service health and API status monitoring, you must set the CollectdEnableSensubility parameter to false.



  1. Open the stf-connectors.yaml and add the CollectdEnableSensubility parameter to override the setting in enable-stf.yaml. Ensure that stf-connectors.yaml is called from the openstack overcloud deploy command after enable-stf.yaml:

    CollectdEnableSensubility: false
  2. Continue with the overcloud deployment procedure. For more information, see Section 4.1.4, “Deploying the overcloud”.

