Chapter 4. Advanced features

The following optional features can provide additional functionality to the Service Telemetry Framework (STF):

4.1. Customizing the deployment

The Service Telemetry Operator watches for a ServiceTelemetry manifest to load into Red Hat OpenShift Container Platform (OCP). The Operator then creates other objects in memory, which results in the dependent Operators creating the workloads they are responsible for managing.

Warning

When you override the manifest, you must provide the entire manifest contents, including object names or namespaces. There is no dynamic parameter substitution when you override a manifest.

Use manifest overrides only as a last resort short circuit.

To override a manifest successfully with Service Telemetry Framework (STF), deploy a default environment using the core options only. For more information about the core options, see Section 2.3.11, “Creating a ServiceTelemetry object in OCP”. When you deploy STF, use the oc get command to retrieve the default deployed manifest. When you use a manifest that was originally generated by Service Telemetry Operator, the manifest is compatible with the other objects that are managed by the Operators.

For example, when the backends.metrics.prometheus.enabled: true parameter is configured in the ServiceTelemetry manifest, the Service Telemetry Operator requests components for metrics retrieval and storage using the default manifests. In some cases, you might want to override the default manifest. For more information, see Section 4.1.1, “Manifest override parameters”.

4.1.1. Manifest override parameters

This table describes the available parameters that you can use to override a manifest, along with the corresponding retrieval commands.

Table 4.1. Manifest override parameters

Override parameterDescriptionRetrieval command

alertmanagerManifest

Override the Alertmanager object creation. The Prometheus Operator watches for Alertmanager objects.

oc get alertmanager default -oyaml

alertmanagerConfigManifest

Override the Secret that contains the Alertmanager configuration. The Prometheus Operator uses a secret named alertmanager-{{ alertmanager-name }}, for example, default, to provide the alertmanager.yaml configuration to Alertmanager.

oc get secret alertmanager-default -oyaml

elasticsearchManifest

Override the ElasticSearch object creation. The Elastic Cloud on Kuberneters Operator watches for ElasticSearch objects.

oc get elasticsearch elasticsearch -oyaml

interconnectManifest

Override the Interconnect object creation. The AMQ Interconnect Operator watches for Interconnect objects.

oc get interconnect default-interconnect -oyaml

prometheusManifest

Override the Prometheus object creation. The Prometheus Operator watches for Prometheus objects.

oc get prometheus default -oyaml

servicemonitorManifest

Override the ServiceMonitor object creation. The Prometheus Operator watches for ServiceMonitor objects.

oc get servicemonitor default -oyaml

4.1.2. Overriding a managed manifest

Edit the ServiceTelemetry object and provide a parameter and manifest. For a list of available manifest override parameters, see Section 4.1, “Customizing the deployment”. The default ServiceTelemetry object is default. Use oc get servicetelemetry to list the available STF deployments.

Tip

The oc edit command loads the default system editor. To override the default editor, pass or set the environment variable EDITOR to the preferred editor. For example, EDITOR=nano oc edit servicetelemetry default.

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Load the ServiceTelemetry object into an editor:

    $ oc edit servicetelemetry default
  4. To modify the ServiceTelemetry object, provide a manifest override parameter and the contents of the manifest to write to OCP instead of the defaults provided by STF.

    Note

    The trailing pipe (|) after entering the manifest override parameter indicates that the value provided is multi-line.

    $ oc edit stf default
    
    apiVersion: infra.watch/v1beta1
    kind: ServiceTelemetry
    metadata:
      ...
    spec:
      alertmanagerConfigManifest: | 1
        apiVersion: v1
        kind: Secret
        metadata:
          name: 'alertmanager-default'
          namespace: 'service-telemetry'
        type: Opaque
        stringData:
          alertmanager.yaml: |-
            global:
              resolve_timeout: 10m
            route:
              group_by: ['job']
              group_wait: 30s
              group_interval: 5m
              repeat_interval: 12h
              receiver: 'null'
            receivers:
            - name: 'null' 2
    status:
      ...
    1
    Manifest override parameter is defined in the spec of the ServiceTelemetry object.
    2
    End of the manifest override content.
  5. Save and close.

4.2. Alerts

You create alert rules in Prometheus and alert routes in Alertmanager. Alert rules in Prometheus servers send alerts to an Alertmanager, which manages the alerts. Alertmanager can silence, inhibit, or aggregate alerts, and send notifications using email, on-call notification systems, or chat platforms.

To create an alert, complete the following tasks:

  1. Create an alert rule in Prometheus. For more information, see Section 4.2.1, “Creating an alert rule in Prometheus”.
  2. Create an alert route in Alertmanager. For more information, see Section 4.2.3, “Creating an alert route in Alertmanager”.

Additional resources

For more information about alerts or notifications with Prometheus and Alertmanager, see https://prometheus.io/docs/alerting/overview/

To view an example set of alerts that you can use with Service Telemetry Framework (STF), see https://github.com/infrawatch/service-telemetry-operator/tree/master/deploy/alerts

4.2.1. Creating an alert rule in Prometheus

Prometheus evaluates alert rules to trigger notifications. If the rule condition returns an empty result set, the condition is false. Otherwise, the rule is true and it triggers an alert.

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Create a PrometheusRule object that contains the alert rule. The Prometheus Operator loads the rule into Prometheus:

    $ oc apply -f - <<EOF
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      creationTimestamp: null
      labels:
        prometheus: default
        role: alert-rules
      name: prometheus-alarm-rules
      namespace: service-telemetry
    spec:
      groups:
        - name: ./openstack.rules
          rules:
            - alert: Metric Listener down
              expr: collectd_qpid_router_status < 1 # To change the rule, edit the value of the expr parameter.
    EOF
  4. To verify that the rules have been loaded into Prometheus by the Operator, create a pod with access to curl:

    $ oc run curl --generator=run-pod/v1 --image=radial/busyboxplus:curl -i --tty
  5. Run curl to access the prometheus-operated service to return the rules loaded into memory:

    [ root@curl:/ ]$ curl prometheus-operated:9090/api/v1/rules
    {"status":"success","data":{"groups":[{"name":"./openstack.rules","file":"/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml","rules":[{"name":"Metric Listener down","query":"collectd_qpid_router_status \u003c 1","duration":0,"labels":{},"annotations":{},"alerts":[],"health":"ok","type":"alerting"}],"interval":30}]}}
  6. To verify that the output shows the rules loaded into the PrometheusRule object, for example the output contains the defined ./openstack.rules, exit from the pod:

    [ root@curl:/ ]$ exit
  7. Clean up the environment by deleting the curl pod:

    $ oc delete pod curl
    
    pod "curl" deleted

4.2.2. Configuring custom alerts

You can add custom alerts to the PrometheusRule object that you created in Section 4.2.1, “Creating an alert rule in Prometheus”.

Procedure

  1. Use the oc edit command:

    $ oc edit prometheusrules prometheus-alarm-rules
  2. Edit the PrometheusRules manifest.
  3. Save and close.

Additional resources

4.2.3. Creating an alert route in Alertmanager

Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as an Red Hat OpenShift Container Platform (OCP) secret. STF by default deploys a basic configuration that results in no receivers:

alertmanager.yaml: |-
  global:
    resolve_timeout: 5m
  route:
    group_by: ['job']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    receiver: 'null'
  receivers:
  - name: 'null'

To deploy a custom Alertmanager route with STF, an alertmanagerConfigManifest parameter must be passed to the Service Telemetry Operator that results in an updated secret, managed by the Prometheus Operator. For more information, see Section 4.1.2, “Overriding a managed manifest”.

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Edit the ServiceTelemetry object for your STF deployment

    $ oc edit stf default
  4. Add a new parameter, alertmanagerConfigManifest, and the Secret object contents to define the alertmanager.yaml configuration for Alertmanager:

    Note

    This step loads the default template that is already managed by Service Telemetry Operator. To verify that the changes are populating correctly, change a value, return the alertmanager-default secret, and verify that the new value is loaded into memory. For example, change the value global.resolve_timeout from 5m to 10m.

    apiVersion: infra.watch/v1beta1
    kind: ServiceTelemetry
    metadata:
      name: default
      namespace: service-telemetry
    spec:
      backends:
        metrics:
          prometheus:
            enabled: true
      alertmanagerConfigManifest: |
        apiVersion: v1
        kind: Secret
        metadata:
          name: 'alertmanager-default'
          namespace: 'service-telemetry'
        type: Opaque
        stringData:
          alertmanager.yaml: |-
            global:
              resolve_timeout: 10m
            route:
              group_by: ['job']
              group_wait: 30s
              group_interval: 5m
              repeat_interval: 12h
              receiver: 'null'
            receivers:
            - name: 'null'
  5. Verify that the configuration was applied to the secret:

    $ oc get secret alertmanager-default -o go-template='{{index .data "alertmanager.yaml" | base64decode }}'
    
    global:
      resolve_timeout: 10m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'null'
    receivers:
    - name: 'null'
  6. To verify the configuration has been loaded into Alertmanager, create a pod with access to curl:

    $ oc run curl --generator=run-pod/v1 --image=radial/busyboxplus:curl -i --tty
  7. Run curl against the alertmanager-operated service to retrieve the status and configYAML contents and review the supplied configuration matches the configuration loaded into Alertmanager:

    [ root@curl:/ ]$ curl alertmanager-operated:9093/api/v1/status
    
    {"status":"success","data":{"configYAML":"global:\n  resolve_timeout: 10m\n  http_config: {}\n  smtp_hello: localhost\n  smtp_require_tls: true\n  pagerduty_url: https://events.pagerduty.com/v2/enqueue\n  hipchat_api_url: https://api.hipchat.com/\n  opsgenie_api_url: https://api.opsgenie.com/\n  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/\n  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/\nroute:\n  receiver: \"null\"\n  group_by:\n  - job\n  group_wait: 30s\n  group_interval: 5m\n  repeat_interval: 12h\nreceivers:\n- name: \"null\"\ntemplates: []\n",...}}
  8. Verify that the configYAML field contains the expected changes. Exit from the pod:

    [ root@curl:/ ]$ exit
  9. To clean up the environment, delete the curl pod:

    $ oc delete pod curl
    
    pod "curl" deleted

Additional resources

  • For more information about the Red Hat OpenShift Container Platform secret and the Prometheus operator, see Alerting.

4.3. Configuring SNMP Traps

You can integrate Service Telemetry Framework (STF) with an existing infrastructure monitoring platform that receives notifications via SNMP traps. To enable SNMP traps, modify the ServiceTelemetry object and configure the snmpTraps parameters.

For more information about configuring alerts, see Section 4.2, “Alerts”.

Prerequisites

  • Know the IP address or hostname of the SNMP trap receiver where you want to send the alerts

Procedure

  1. To enable SNMP traps, modify the ServiceTelemetry object:

    $ oc edit stf default
  2. Set the alerting.alertmanager.receivers.snmpTraps parameters:

    apiVersion: infra.watch/v1beta1
    kind: ServiceTelemetry
    ...
    spec:
      ...
      alerting:
        alertmanager:
          receivers:
            snmpTraps:
              enabled: true
              target: 10.10.10.10
  3. Ensure that you set the value of target to the IP address or hostname of the SNMP trap receiver.

4.4. High availability

High availability is the ability of Service Telemetry Framework (STF) to rapidly recover from failures in its component services. Although Red Hat OpenShift Container Platform (OCP) restarts a failed pod if nodes are available to schedule the workload, this recovery process might take more than one minute, during which time events and metrics are lost. A high availability configuration includes multiple copies of STF components, reducing recovery time to approximately 2 seconds. To protect against failure of an OCP node, deploy STF to an OCP cluster with three or more nodes.

Note

STF is not yet a fully fault tolerant system. Delivery of metrics and events during the recovery period is not guaranteed.

Enabling high availability has the following effects:

  • Three ElasticSearch pods run instead of the default one.
  • The following components run two pods instead of the default one:

    • AMQ Interconnect
    • Alertmanager
    • Prometheus
    • Events Smart Gateway
    • Collectd Metrics Smart Gateway
  • Recovery time from a lost pod in any of these services reduces to approximately 2 seconds.
Note

The Ceilometer Metrics Smart Gateway is not yet HA

4.4.1. Configuring high availability

To configure STF for high availability, add highAvailability.enabled: true to the ServiceTelemetry object in OCP. You can this set this parameter at installation time or, if you already deployed STF, complete the following steps:

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Use the oc command to edit the ServiceTelemetry object:

    $ oc edit stf default
  4. Add highAvailability.enabled: true to the spec section:

    apiVersion: infra.watch/v1beta1
    kind: ServiceTelemetry
    ...
    spec:
      ...
      highAvailability:
        enabled: true
  5. Save your changes and close the object.

4.5. Dashboards

Use third-party application Grafana to visualize system-level metrics gathered by collectd for each individual host node. For more information about configuring collectd, see Section 3.3, “Configuring Red Hat OpenStack Platform overcloud for Service Telemetry Framework”.

4.5.1. Setting up Grafana to host the dashboard

Grafana is not included in the default Service Telemetry Framework (STF) deployment so you must deploy the Grafana Operator from OperatorHub.io. Using the Service Telemetry Operator to deploy Grafana results in a Grafana instance and the configuration of the default data sources for the local STF deployment.

Prerequisites

Enable OperatorHub.io catalog source for the Grafana Operator. For more information, see Section 2.3.5, “Enabling the OperatorHub.io Community Catalog Source”.

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Deploy the Grafana operator:

    $ oc apply -f - <<EOF
    apiVersion: operators.coreos.com/v1alpha1
    kind: Subscription
    metadata:
      name: grafana-operator
      namespace: service-telemetry
    spec:
      channel: alpha
      installPlanApproval: Automatic
      name: grafana-operator
      source: operatorhubio-operators
      sourceNamespace: openshift-marketplace
    EOF
  4. To verify that the operator launched successfully, run the oc get csv command. If the value of the PHASE column is Succeeded, the operator launched successfully:

    $ oc get csv
    NAME                                DISPLAY                                         VERSION   REPLACES                            PHASE
    grafana-operator.v3.2.0             Grafana Operator                                3.2.0                                         Succeeded
    ...
  5. To launch a Grafana instance, create or modify the ServiceTelemetry object. Set graphing.enabled to true.

    $ oc edit stf default
    apiVersion: infra.watch/v1beta1
    kind: ServiceTelemetry
    ...
    spec:
      ...
      graphing:
        enabled: true
  6. Verify that the Grafana instance deployed:

    $ oc get pod -l app=grafana
    NAME                                  READY   STATUS    RESTARTS   AGE
    grafana-deployment-7fc7848b56-sbkhv   1/1     Running   0          1m

4.5.2. Importing dashboards

The Grafana Operator can import and manage dashboards by creating GrafanaDashboard objects. You can view example dashboards at https://github.com/infrawatch/dashboards.

Procedure

  1. Import a dashboard:

    $ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/rhos-dashboard.yaml
    grafanadashboard.integreatly.org/rhos-dashboard created
  2. Verify that the resources installed correctly:

    $ oc get grafanadashboards
    NAME             AGE
    rhos-dashboard   7d21h
    $ oc get grafanadatasources
    NAME                    AGE
    default-ds-prometheus   20h
  3. Expose the grafana service as a route:

    $ oc create route edge dashboards --service=grafana-service --insecure-policy="Redirect" --port=3000
  4. Retrieve the Grafana route address:

    $ oc get route dashboards
    NAME         HOST/PORT                                                                    PATH   SERVICES          PORT   TERMINATION     WILDCARD
    dashboards   dashboards-service-telemetry.apps.stfcloudops1.lab.upshift.rdu2.redhat.com          grafana-service   3000   edge/Redirect   None

    The HOST/PORT value is the Grafana route address.

  5. Navigate to https://<GRAFANA-ROUTE-ADDRESS> in a web browser. Replace <GRAFANA-ROUTE-ADDRESS> with the HOST/PORT value that you retrieved in the previous step.
  6. To view the dashboard, click Dashboards and Manage.

4.5.3. Viewing and editing queries

Procedure

  1. Log in to Red Hat OpenShift Container Platform. To view and edit queries, log in as the admin user.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. To retrieve the default username and password, describe the Grafana object using the oc describe command:

    $ oc describe grafana default
    Tip

    To set the admin username and password through the ServiceTelemetry object, use the graphing.grafana.adminUser and graphing.grafana.adminPassword parameters.

4.5.4. The Grafana infrastructure dashboard

The infrastructure dashboard shows metrics for a single node at a time. Select a node from the upper left corner of the dashboard.

4.5.4.1. Top panels

Title

Unit

Description

Current Global Alerts

-

Current alerts fired by Prometheus

Recent Global Alerts

-

Recently fired alerts in 5m time steps

Status Panel

-

Node status: up, down, unavailable

Uptime

s/m/h/d/M/Y

Total operational time of node

CPU Cores

cores

Total number of cores

Memory

bytes

Total memory

Disk Size

bytes

Total storage size

Processes

processes

Total number of processes listed by type

Load Average

processes

Load average represents the average number of running and uninterruptible processes residing in the kernel execution queue.

4.5.4.2. Networking panels

Panels that display the network interfaces of the node.

Panel

Unit

Description

Physical Interfaces Ingress Errors

errors

Total errors with incoming data

Physical Interfaces Egress Errors

errors

Total errors with outgoing data

Physical Interfaces Ingress Error Rates

errors/s

Rate of incoming data errors

Physical Interfaces egress Error Rates

errors/s

Rate of outgoing data errors

Physical Interfaces Packets Ingress pps Incoming packets per second

Physical Interfaces Packets Egress

pps

Outgoing packets per second

Physical Interfaces Data Ingress

bytes/s

Incoming data rates

Physical Interfaces Data Egress

bytes/s

Outgoing data rates

Physical Interfaces Drop Rate Ingress

pps

Incoming packets drop rate

Physical Interfaces Drop Rate Egress

pps

4.5.4.3. CPU panels

Panels that display CPU usage of the node.

PanelUnitDescription

Current CPU Usage

percent

Instantaneous usage at the time of the last query.

Aggregate CPU Usage

percent

Average non-idle CPU activity of all cores on a node.

Aggr. CPU Usage by Type

percent

Shows time spent for each type of thread averaged across all cores.

4.5.4.4. Memory panels

Panels that display memory usage on the node.

PanelUnitDescription

Memory Used

percent

Amount of memory being used at time of last query.

Huge Pages Used

hugepages

Number of hugepages being used.

Memory

4.5.4.5. Disk/file system

Panels that display space used on disk.

PanelUnitDescriptionNotes

Disk Space Usage

percent

Total disk use at time of last query.

 

Inode Usage

percent

Total inode use at time of last query.

 

Aggregate Disk Space Usage

bytes

Total disk space used and reserved.

Because this query relies on the df plugin, temporary file systems that do not necessarily use disk space are included in the results. The query tries to filter out most of these, but it might not be exhaustive.

Disk Traffic

bytes/s

Shows rates for both reading and writing.

 

Disk Load

percent

Approximate percentage of total disk bandwidth being used. The weighted I/O time series includes the backlog that might be accumulating. For more information, see the collectd disk plugin docs.

 

Operations/s

ops/s

Operations done per second

 

Average I/O Operation Time

seconds

Average time each I/O operation took to complete. This average is not accurate, see the collectd disk plugin docs.

 

4.6. Multiple cloud configuration

You can configure multiple Red Hat OpenStack Platform clouds to target a single instance of Service Telemetry Framework (STF):

  1. Plan the AMQP address prefixes that you want to use for each cloud. For more information, see Section 4.6.1, “Planning AMQP address prefixes”.
  2. Deploy metrics and events consumer Smart Gateways for each cloud to listen on the corresponding address prefixes. For more information, see Section 4.6.2, “Deploying Smart Gateways”.
  3. Configure each cloud to send its metrics and events to STF on the correct address. For more information, see Section 4.6.4, “Creating the OpenStack environment file”.

Figure 4.1. Two Red Hat OpenStack Platform clouds connect to STF

An exmaple of two Red Hat OpenStack Platform clouds connecting to STF

4.6.1. Planning AMQP address prefixes

By default, Red Hat OpenStack Platform nodes get data through two data collectors; collectd and Ceilometer. These components send telemetry data or notifications to the respective AMQP addresses, for example, collectd/telemetry, where STF Smart Gateways listen on those addresses for monitoring data.

To support multiple clouds and to identify which cloud generated the monitoring data, configure each cloud to send data to a unique address. Prefix a cloud identifier to the second part of the address. The following list shows some example addresses and identifiers:

  • collectd/cloud1-telemetry
  • collectd/cloud1-notify
  • anycast/ceilometer/cloud1-metering.sample
  • anycast/ceilometer/cloud1-event.sample
  • collectd/cloud2-telemetry
  • collectd/cloud2-notify
  • anycast/ceilometer/cloud2-metering.sample
  • anycast/ceilometer/cloud2-event.sample
  • collectd/us-east-1-telemetry
  • collectd/us-west-3-telemetry

4.6.2. Deploying Smart Gateways

You must deploy a Smart Gateway for each of the data collection types for each cloud; one for collectd metrics, one for collectd events, one for Ceilometer metrics, and one for Ceilometer events. Configure each of the Smart Gateways to listen on the AMQP address that you define for the corresponding cloud. Smart Gateways are defined via the clouds parameter in the ServiceTelemetry manifest.

When you deploy STF for the first time, Smart Gateway manifests are created that define the initial Smart Gateways for a single cloud. When deploying Smart Gateways for multiple cloud support, you deploy multiple Smart Gateways for each of the data collection types that handle the metrics and the events data for each cloud. The initial Smart Gateways are defined under cloud1 with the following subscription addresses:

collector

type

default subscription address

collectd

metrics

collectd/telemetry

collectd

events

collectd/notify

Ceilometer

metrics

anycast/ceilometer/metering.sample

Ceilometer

events

anycast/ceilometer/event.sample

Prerequisites

You have determined your naming scheme and have created your list of clouds objects. For more information about determining your naming scheme, see ]. For more information about creating the content for the clouds parameter, see xref:clouds_installing-the-core-components-of-stf[.

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Edit the default ServiceTelemetry object and add a clouds parameter with your configuration:

    $ oc edit stf default
    apiVersion: infra.watch/v1beta1
    kind: ServiceTelemetry
    metadata:
      ...
    spec:
      ...
      clouds:
      - name: cloud1
        events:
          collectors:
          - collectorType: collectd
            subscriptionAddress: collectd/cloud1-notify
          - collectorType: ceilometer
            subscriptionAddress: anycast/ceilometer/cloud1-event.sample
        metrics:
          collectors:
          - collectorType: collectd
            subscriptionAddress: collectd/cloud1-telemetry
          - collectorType: ceilometer
            subscriptionAddress: anycast/ceilometer/cloud1-metering.sample
      - name: cloud2
        events:
          ...
  4. Save the ServiceTelemetry object.
  5. Verify that each Smart Gateway is running. This can take several minutes depending on the number of Smart Gateways:

    $ oc get po -l app=smart-gateway
    NAME                                                      READY   STATUS    RESTARTS   AGE
    default-cloud1-ceil-event-smartgateway-6cfb65478c-g5q82   1/1     Running   0          13h
    default-cloud1-ceil-meter-smartgateway-58f885c76d-xmxwn   1/1     Running   0          13h
    default-cloud1-coll-event-smartgateway-58fbbd4485-rl9bd   1/1     Running   0          13h
    default-cloud1-coll-meter-smartgateway-7c6fc495c4-jn728   2/2     Running   0          13h

4.6.3. Deleting the default Smart Gateways

After you configure STF for multiple clouds, you can delete the default Smart Gateways if they are no longer in use. The Service Telemetry Operator can remove SmartGateway objects that have been created but are no longer listed in the ServiceTelemetry clouds list of objects. You can enable the removal of SmartGateway objects that are not defined by the clouds parameter by setting cloudsRemoveOnMissing: true in the ServiceTelemetry manifest.

Tip

If you do not want any Smart Gateways deployed, define an empty clouds object using the clouds: {} parameter.

Warning

The cloudsRemoveOnMissing parameter is disabled by default. If you enable the cloudsRemoveOnMissing parameter, you remove any manually created SmartGateway objects in the current namespace without any possibility to restore.

Procedure

  1. Define your clouds parameter with the list of cloud objects to be managed by the Service Telemetry Operator. For more information, see Section 2.3.10.2, “clouds”.
  2. Edit the ServiceTelemetry object and add the cloudsRemoveOnMissing parameter:

    apiVersion: infra.watch/v1beta1
    kind: ServiceTelemetry
    metadata:
      ...
    spec:
      ...
      cloudsRemoveOnMissing: true
      clouds:
        ...
  3. Save the modifications.
  4. Verify that the Operator deleted the Smart Gateways. This can take several minutes while the Operators reconcile the changes:

    $ oc get smartgateways

4.6.4. Creating the OpenStack environment file

To label traffic according to the cloud of origin, you must create a configuration with cloud-specific instance names. Create an stf-connectors.yaml file and adjust the values of CeilometerQdrEventsConfig, CeilometerQdrMetricsConfig and CollectdAmqpInstances to match the AMQP address prefix scheme. For more information, see Section 4.6.1, “Planning AMQP address prefixes”.

Warning

Remove enable-stf.yaml and ceilometer-write-qdr.yaml environment file references from your overcloud deployment. This configuration is redundant and results in duplicate information being sent from each cloud node.

Procedure

  1. Create the stf-connectors.yaml file and modify it to match the AMQP address that you want for this cloud deployment:

    stf-connectors.yaml

    resource_registry:
        OS::TripleO::Services::Collectd: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml
        OS::TripleO::Services::MetricsQdr: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/qdr-container-puppet.yaml
        OS::TripleO::Services::CeilometerAgentCentral: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-central-container-puppet.yaml
        OS::TripleO::Services::CeilometerAgentNotification: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-notification-container-puppet.yaml
        OS::TripleO::Services::CeilometerAgentIpmi: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-ipmi-container-puppet.yaml
        OS::TripleO::Services::ComputeCeilometerAgent: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-compute-container-puppet.yaml
        OS::TripleO::Services::Redis: /usr/share/openstack-tripleo-heat-templates/deployment/database/redis-pacemaker-puppet.yaml
    
    parameter_defaults:
        EnableSTF: true
    
        EventPipelinePublishers: []
        MetricPipelinePublishers: []
        CeilometerEnablePanko: false
        CeilometerQdrPublishEvents: true
        CeilometerQdrEventsConfig:
            driver: amqp
            topic: cloud1-event   1
        CeilometerQdrMetricsConfig:
            driver: amqp
            topic: cloud1-metering   2
    
    
        CollectdConnectionType: amqp1
        CollectdAmqpInterval: 5
        CollectdDefaultPollingInterval: 5
    
        CollectdAmqpInstances:
            cloud1-notify:        3
                notify: true
                format: JSON
                presettle: false
            cloud1-telemetry:     4
                format: JSON
                presettle: true
    
        MetricsQdrAddresses:
            - prefix: collectd
              distribution: multicast
            - prefix: anycast/ceilometer
              distribution: multicast
    
        MetricsQdrSSLProfiles:
            - name: sslProfile
    
        MetricsQdrConnectors:
            - host: stf-default-interconnect-5671-service-telemetry.apps.infra.watch   5
              port: 443
              role: edge
              verifyHostname: false
              sslProfile: sslProfile

    1
    Define the topic for Ceilometer events. This value is the address format of anycast/ceilometer/cloud1-event.sample.
    2
    Define the topic for Ceilometer metrics. This value is the address format of anycast/ceilometer/cloud1-metering.sample.
    3
    Define the topic for collectd events. This value is the format of collectd/cloud1-notify.
    4
    Define the topic for collectd metrics. This value is the format of collectd/cloud1-telemetry.
    5
    Adjust the MetricsQdrConnectors host to the address of the STF route.
  2. Ensure that the naming convention in the stf-connectors.yaml file aligns with the spec.amqpUrl field in the Smart Gateway configuration. For example, configure the CeilometerQdrEventsConfig.topic field to a value of cloud1-event.
  3. Save the file in a directory for custom environment files, for example /home/stack/custom_templates/.
  4. Source the authentication file:

    [stack@undercloud-0 ~]$ source stackrc
    
    (undercloud) [stack@undercloud-0 ~]$
  5. Include the stf-connectors.yaml file in the overcloud deployment command, along with any other environment files relevant to your environment:

    (undercloud) [stack@undercloud-0 ~]$ openstack overcloud deploy \
    --templates /usr/share/openstack-tripleo-heat-templates \
    ...
    -e /home/stack/custom_templates/stf-connectors.yaml \
    ...

Additional resources

For information about validating the deployment, see Section 3.3.3, “Validating client-side installation”.

4.6.5. Querying metrics data from multiple clouds

Data stored in Prometheus has a service label attached according to the Smart Gateway it was scraped from. You can use this label to query data from a specific cloud.

To query data from a specific cloud, use a Prometheus promql query that matches the associated service label; for example: collectd_uptime{service="default-cloud1-coll-meter-smartgateway"}.

4.7. Ephemeral storage

Use ephemeral storage to run Service Telemetry Framework (STF) without persistently storing data in your Red Hat OpenShift Container Platform (OCP) cluster. Ephemeral storage is not recommended in a production environment due to the volatility of the data in the platform when operating correctly and as designed. For example, restarting a pod or rescheduling the workload to another node results in the loss of any local data written since the pod started.

4.7.1. Configuring ephemeral storage

To configure STF components for ephemeral storage, add ...storage.strategy: ephemeral to the corresponding parameter. For example, to enable ephemeral storage for the Prometheus backend, set backends.metrics.prometheus.storage.strategy: ephemeral. Components that support configuration of ephemeral storage include alerting.alertmanager, backends.metrics.prometheus, and backends.events.elasticsearch. You can add ephemeral storage configuration at installation time or, if you already deployed STF, complete the following steps:

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    $ oc project service-telemetry
  3. Edit the ServiceTelemetry object:

    $ oc edit stf default
  4. Add the ...storage.strategy: ephemeral parameter to the spec section of the relevant component:

    apiVersion: infra.watch/v1beta1
    kind: ServiceTelemetry
    metadata:
      name: stf-default
      namespace: service-telemetry
    spec:
      alerting:
        enabled: true
        alertmanager:
          storage:
            strategy: ephemeral
      backends:
        metrics:
          prometheus:
            enabled: true
            storage:
              strategy: ephemeral
        events:
          elasticsearch:
            enabled: true
            storage:
              strategy: ephemeral
  5. Save your changes and close the object.

4.8. Monitoring the resource usage of Red Hat OpenStack Platform services

Monitor the resource usage of the Red Hat OpenStack Platform services, such as the APIs and other infrastructure processes, to identify bottlenecks in the overcloud by showing services running out of compute power. Enable the collectd-libpod-stats plug-in to gather CPU and memory usage metrics for every container running in the overcloud.

Prerequisites

Procedure

  1. Open the stf-connectors.yaml file.
  2. Add the following configuration to parameter_defaults:

      CollectdEnableLibpodstats: true
  3. Continue with the overcloud deployment procedure.