Chapter 4. Advanced features

The following optional features can provide additional functionality to the Service Telemetry Framework (STF):

4.1. Customizing the deployment

The Service Telemetry Operator watches for a ServiceTelemetry manifest to load into Red Hat OpenShift Container Platform (OCP). The Operator then creates other objects in memory, which results in the dependent Operators creating the workloads they are responsible for managing.

Warning

When you override the manifest, you must provide the entire manifest contents, including object names or namespaces. There is no dynamic parameter substitution when you override a manifest.

To override a manifest successfully with Service Telemetry Framework (STF), deploy a default environment using the core options only. For more information about the core options, see Section 2.3.10, “Creating a ServiceTelemetry object in OCP”. When you deploy STF, use the oc get command to retrieve the default deployed manifest. When you use a manifest that was originally generated by Service Telemetry Operator, the manifest is compatible with the other objects that are managed by the Operators.

For example, when the metricsEnabled: true parameter is configured in the ServiceTelemetry manifest, the Service Telemetry Operator requests components for metrics retrieval and storage using the default manifests. In some cases, you might want to override the default manifest. For more information, see Section 4.1.1, “Manifest override parameters”.

4.1.1. Manifest override parameters

This table describes the available parameters that you can use to override a manifest, along with the corresponding retrieval commands.

Table 4.1. Manifest override parameters

Override parameterDescriptionRetrieval command

alertmanagerManifest

Override the Alertmanager object creation. The Prometheus Operator watches for Alertmanager objects.

oc get alertmanager stf-default -oyaml

alertmanagerConfigManifest

Override the Secret that contains the Alertmanager configuration. The Prometheus Operator uses a secret named alertmanager-{{ alertmanager-name }}, for example, stf-default, to provide the alertmanager.yaml configuration to Alertmanager.

oc get secret alertmanager-stf-default -oyaml

elasticsearchManifest

Override the ElasticSearch object creation. The Elastic Cloud on Kuberneters Operator watches for ElasticSearch objects.

oc get elasticsearch elasticsearch -oyaml

interconnectManifest

Override the Interconnect object creation. The AMQ Interconnect Operator watches for Interconnect objects.

oc get interconnect stf-default-interconnect -oyaml

prometheusManifest

Override the Prometheus object creation. The Prometheus Operator watches for Prometheus objects.

oc get prometheus stf-default -oyaml

servicemonitorManifest

Override the ServiceMonitor object creation. The Prometheus Operator watches for ServiceMonitor objects.

oc get servicemonitor stf-default -oyaml

smartgatewayCollectdMetricsManifest

Override the SmartGateway object creation for collectd metrics. The Smart Gateway Operator watches for SmartGateway objects.

oc get smartgateway stf-default-collectd-telemetry -oyaml

smartgatewayCollectdEventsManifest

Override the SmartGateway object creation for collectd events. The Smart Gateway Operator watches for SmartGateway objects.

oc get smartgateway stf-default-collectd-notification -oyaml

smartgatewayCeilometerEventsManifest

Override the SmartGateway object creation for Ceilometer events. The Smart Gateway Operator watches for SmartGateway objects.

oc get smartgateway stf-default-ceilometer-notification -oyaml

4.1.2. Overriding a managed manifest

Edit the ServiceTelemetry object and provide a parameter and manifest. For a list of available manifest override parameters, see Section 4.1, “Customizing the deployment”. The default ServiceTelemetry object is stf-default. Use oc get servicetelemetry to list the available STF deployments.

Tip

The oc edit command loads the default system editor. To override the default editor, pass or set the environment variable EDITOR to the preferred editor. For example, EDITOR=nano oc edit servicetelemetry stf-default.

Procedure
  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    oc project service-telemetry
  3. Load the ServiceTelemetry object into an editor:

    oc edit servicetelemetry stf-default
  4. To modify the ServiceTelemetry object, provide a manifest override parameter and the contents of the manifest to write to OCP instead of the defaults provided by STF.

    Note

    The trailing pipe (|) after entering the manifest override parameter indicates that the value provided is multi-line.

    $ oc edit servicetelemetry stf-default
    
    apiVersion: infra.watch/v1alpha1
    kind: ServiceTelemetry
    metadata:
      annotations:
        kubectl.kubernetes.io/last-applied-configuration: |
          {"apiVersion":"infra.watch/v1alpha1","kind":"ServiceTelemetry","metadata":{"annotations":{},"name":"stf-default","namespace":"service-telemetry"},"spec":{metricsEnabled":true}}
      creationTimestamp: "2020-04-14T20:29:42Z"
      generation: 1
      name: stf-default
      namespace: service-telemetry
      resourceVersion: "1949423"
      selfLink: /apis/infra.watch/v1alpha1/namespaces/service-telemetry/servicetelemetrys/stf-default
      uid: d058bc41-1bb0-49f5-9a8b-642f4b8adb95
    spec:
      metricsEnabled: true
      smartgatewayCollectdMetricsManifest: | 1
        apiVersion: smartgateway.infra.watch/v2alpha1
        kind: SmartGateway
        metadata:
          name: stf-default-collectd-telemetry
          namespace: service-telemetry
        spec:
          amqpUrl: stf-default-interconnect.service-telemetry.svc.cluster.local:5672/collectd/telemetry
          debug: true
          prefetch: 15000
          serviceType: metrics
          size: 1
          useTimestamp: true 2
    status:
      conditions:
      - ansibleResult:
          changed: 0
          completion: 2020-04-14T20:32:19.079508
          failures: 0
          ok: 52
          skipped: 1
        lastTransitionTime: "2020-04-14T20:29:59Z"
        message: Awaiting next reconciliation
        reason: Successful
        status: "True"
        type: Running
    1
    Manifest override parameter is defined in the spec of the ServiceTelemetry object.
    2
    End of the manifest override content.
  5. Save and close.

4.2. Alerts

You create alert rules in Prometheus and alert routes in Alertmanager. Alert rules in Prometheus servers send alerts to an Alertmanager, which manages the alerts. Alertmanager can silence, inhibit, or aggregate alerts, and send notifications using email, on-call notification systems, or chat platforms.

To create an alert, complete the following tasks:

  1. Create an alert rule in Prometheus. For more information, see Section 4.2.1, “Creating an alert rule in Prometheus”.
  2. Create an alert route in Alertmanager. For more information, see Section 4.2.3, “Creating an alert route in Alertmanager”.

Additional resources

For more information about alerts or notifications with Prometheus and Alertmanager, see https://prometheus.io/docs/alerting/overview/

To view an example set of alerts that you can use with Service Telemetry Framework (STF), see https://github.com/infrawatch/service-telemetry-operator/tree/master/deploy/alerts

4.2.1. Creating an alert rule in Prometheus

Prometheus evaluates alert rules to trigger notifications. If the rule condition returns an empty result set, the condition is false. Otherwise, the rule is true and it triggers an alert.

Procedure
  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    oc project service-telemetry
  3. Create a PrometheusRule object that contains the alert rule. The Prometheus Operator loads the rule into Prometheus:

    oc apply -f - <<EOF
    apiVersion: monitoring.coreos.com/v1
    kind: PrometheusRule
    metadata:
      creationTimestamp: null
      labels:
        prometheus: stf-default
        role: alert-rules
      name: prometheus-alarm-rules
      namespace: service-telemetry
    spec:
      groups:
        - name: ./openstack.rules
          rules:
            - alert: Metric Listener down
              expr: collectd_qpid_router_status < 1 # To change the rule, edit the value of the expr parameter.
    EOF
  4. To verify that the rules have been loaded into Prometheus by the Operator, create a pod with access to curl:

    oc run curl --generator=run-pod/v1 --image=radial/busyboxplus:curl -i --tty
  5. Run curl to access the prometheus-operated service to return the rules loaded into memory:

    [ root@curl:/ ]$ curl prometheus-operated:9090/api/v1/rules
    {"status":"success","data":{"groups":[{"name":"./openstack.rules","file":"/etc/prometheus/rules/prometheus-stf-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml","rules":[{"name":"Metric Listener down","query":"collectd_qpid_router_status \u003c 1","duration":0,"labels":{},"annotations":{},"alerts":[],"health":"ok","type":"alerting"}],"interval":30}]}}
  6. To verify that the output shows the rules loaded into the PrometheusRule object, for example the output contains the defined ./openstack.rules, exit from the pod:

    [ root@curl:/ ]$ exit
  7. Clean up the environment by deleting the curl pod:

    $ oc delete pod curl
    
    pod "curl" deleted
Additional resources

For more information on alerting, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md

4.2.2. Configuring custom alerts

You can add custom alerts to the PrometheusRule object that you created in Section 4.2.1, “Creating an alert rule in Prometheus”.

Procedure
  1. Use the oc edit command:

    oc edit prometheusrules prometheus-alarm-rules
  2. Edit the PrometheusRules manifest.
  3. Save and close.
Additional resources

For more information about configuring alerting rules, see https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/.

For more information about PrometheusRules objects, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md

4.2.3. Creating an alert route in Alertmanager

Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as an Red Hat OpenShift Container Platform (OCP) secret. STF by default deploys a basic configuration that results in no receivers:

alertmanager.yaml: |-
  global:
    resolve_timeout: 5m
  route:
    group_by: ['job']
    group_wait: 30s
    group_interval: 5m
    repeat_interval: 12h
    receiver: 'null'
  receivers:
  - name: 'null'

To deploy a custom Alertmanager route with STF, an alertmanagerConfigManifest parameter must be passed to the Service Telemetry Operator that results in an updated secret, managed by the Prometheus Operator. For more information, see Section 4.1.2, “Overriding a managed manifest”.

Procedure
  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    oc project service-telemetry
  3. Edit the ServiceTelemetry object for your STF deployment

    oc edit servicetelemetry stf-default
  4. Add a new parameter, alertmanagerConfigManifest and the Secret object contents to define the alertmanager.yaml configuration for Alertmanager:

    Note

    This loads the default template that is already managed by Service Telemetry Operator. To validate the changes are populating correctly, change a value, return the alertmanager-stf-default secret, and verify that the new value is loaded into memory, for example, changing the value global.resolve_timeout from 5m to 10m.

    apiVersion: infra.watch/v1alpha1
    kind: ServiceTelemetry
    metadata:
      name: stf-default
      namespace: service-telemetry
    spec:
      metricsEnabled: true
      alertmanagerConfigManifest: |
        apiVersion: v1
        kind: Secret
        metadata:
          name: 'alertmanager-stf-default'
          namespace: 'service-telemetry'
        type: Opaque
        stringData:
          alertmanager.yaml: |-
            global:
              resolve_timeout: 10m
            route:
              group_by: ['job']
              group_wait: 30s
              group_interval: 5m
              repeat_interval: 12h
              receiver: 'null'
            receivers:
            - name: 'null'
  5. Verify that the configuration was applied to the secret:

    $ oc get secret alertmanager-stf-default -o go-template='{{index .data "alertmanager.yaml" | base64decode }}'
    
    global:
      resolve_timeout: 10m
    route:
      group_by: ['job']
      group_wait: 30s
      group_interval: 5m
      repeat_interval: 12h
      receiver: 'null'
    receivers:
    - name: 'null'
  6. To verify the configuration has been loaded into Alertmanager, create a pod with access to curl:

    oc run curl --generator=run-pod/v1 --image=radial/busyboxplus:curl -i --tty
  7. Run curl against the alertmanager-operated service to retrieve the status and configYAML contents and review the supplied configuration matches the configuration loaded into Alertmanager:

    [ root@curl:/ ]$ curl alertmanager-operated:9093/api/v1/status
    
    {"status":"success","data":{"configYAML":"global:\n  resolve_timeout: 10m\n  http_config: {}\n  smtp_hello: localhost\n  smtp_require_tls: true\n  pagerduty_url: https://events.pagerduty.com/v2/enqueue\n  hipchat_api_url: https://api.hipchat.com/\n  opsgenie_api_url: https://api.opsgenie.com/\n  wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/\n  victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/\nroute:\n  receiver: \"null\"\n  group_by:\n  - job\n  group_wait: 30s\n  group_interval: 5m\n  repeat_interval: 12h\nreceivers:\n- name: \"null\"\ntemplates: []\n",...}}
  8. Verify that the configYAML field contains the expected changes. Exit from the pod:

    [ root@curl:/ ]$ exit
  9. To clean up the environment, delete the curl pod:

    $ oc delete pod curl
    
    pod "curl" deleted
Additional resources

For more information about the Red Hat OpenShift Container Platform secret and the Prometheus operator, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md

4.3. High availability

High availability is the ability of Service Telemetry Framework (STF) to rapidly recover from failures in its component services. Although Red Hat OpenShift Container Platform (OCP) restarts a failed pod if nodes are available to schedule the workload, this recovery process might take more than one minute, during which time events and metrics are lost. A high availability configuration includes multiple copies of STF components, reducing recovery time to approximately 2 seconds. To protect against failure of an OCP node, deploy STF to an OCP cluster with three or more nodes.

Note

STF is not yet a fully fault tolerant system. Delivery of metrics and events during the recovery period is not guaranteed.

Enabling high availability has the following effects:

  • Two AMQ Interconnect pods run instead of the default 1.
  • Three ElasticSearch pods run instead of the default 1.
  • Recovery time from a lost pod in either of these services reduces to approximately 2 seconds.

4.3.1. Configuring high availability

To configure STF for high availability, add highAvailabilityEnabled: true to the ServiceTelemetry object in OCP. You can this set this parameter at installation time or, if you already deployed STF, complete the following steps:

Procedure
  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    oc project service-telemetry
  3. Use the oc command to edit the ServiceTelemetry object:

    $ oc edit ServiceTelemetry
  4. Add highAvailabilityEnabled: true to the spec section:

    spec:
      eventsEnabled: true
      metricsEnabled: true
      highAvailabilityEnabled: true
  5. Save your changes and close the object.

4.4. Dashboards

Use third-party application Grafana to visualize system-level metrics gathered by collectd for each individual host node. For more information about configuring collectd, see Section 3.3, “Configuring Red Hat OpenStack Platform overcloud for Service Telemetry Framework”.

4.4.1. Setting up Grafana to host the dashboard

Grafana is not included in the default Service Telemetry Framework (STF) deployment so you must deploy the Grafana Operator from OperatorHub.io.

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    oc project service-telemetry
  3. Clone the dashboard repository.

    git clone https://github.com/infrawatch/dashboards
    cd dashboards
  4. Deploy the Grafana operator:

    oc create -f deploy/subscription.yaml
  5. To verify that the operator launched successfully, run the oc get csv command. If the value of the PHASE column is Succeeded, the operator launched successfully:

    $ oc get csv
    
    NAME                                DISPLAY                                         VERSION   REPLACES                            PHASE
    grafana-operator.v3.2.0             Grafana Operator                                3.2.0                                         Succeeded
    ...
  6. Launch a Grafana instance:

    $ oc create -f deploy/grafana.yaml
  7. Verify that the Grafana instance deployed:

    $ oc get pod -l app=grafana
    
    NAME                                  READY   STATUS    RESTARTS   AGE
    grafana-deployment-7fc7848b56-sbkhv   1/1     Running   0          1m
  8. Create the datasource and dashboard resources:

    oc create -f deploy/datasource.yaml \
        -f deploy/rhos-dashboard.yaml
  9. Verify that the resources installed correctly:

    $ oc get grafanadashboards
    
    NAME             AGE
    rhos-dashboard   7d21h
    
    $ oc get grafanadatasources
    NAME                                  AGE
    service-telemetry-grafanadatasource   1m
  10. Navigate to https://<grafana-route-address> in a web browser. Use the oc get routes command to retrieve the Grafana route address:

    oc get routes
  11. To view the dashboard, click Dashboards and Manage.

Additional resources

4.4.1.1. Viewing and editing queries

Procedure

  1. Log in to Red Hat OpenShift Container Platform. To view and edit queries, log in as the admin user.
  2. Change to the service-telemetry namespace:

    oc project service-telemetry
  3. To retrieve the default username and password, describe the Grafana object using oc describe:

    oc describe grafana service-telemetry-grafana

4.4.2. The Grafana infrastructure dashboard

The infrastructure dashboard shows metrics for a single node at a time. Select a node from the upper left corner of the dashboard.

4.4.2.1. Top panels

Title

Unit

Description

Current Global Alerts

-

Current alerts fired by Prometheus

Recent Global Alerts

-

Recently fired alerts in 5m time steps

Status Panel

-

Node status: up, down, unavailable

Uptime

s/m/h/d/M/Y

Total operational time of node

CPU Cores

cores

Total number of cores

Memory

bytes

Total memory

Disk Size

bytes

Total storage size

Processes

processes

Total number of processes listed by type

Load Average

processes

Load average represents the average number of running and uninterruptible processes residing in the kernel execution queue.

4.4.2.2. Networking panels

Panels that display the network interfaces of the node.

Panel

Unit

Description

Physical Interfaces Ingress Errors

errors

Total errors with incoming data

Physical Interfaces Egress Errors

errors

Total errors with outgoing data

Physical Interfaces Ingress Error Rates

errors/s

Rate of incoming data errors

Physical Interfaces egress Error Rates

errors/s

Rate of outgoing data errors

Physical Interfaces Packets Ingress pps Incoming packets per second

Physical Interfaces Packets Egress

pps

Outgoing packets per second

Physical Interfaces Data Ingress

bytes/s

Incoming data rates

Physical Interfaces Data Egress

bytes/s

Outgoing data rates

Physical Interfaces Drop Rate Ingress

pps

Incoming packets drop rate

Physical Interfaces Drop Rate Egress

pps

4.4.2.3. CPU panels

Panels that display CPU usage of the node.

PanelUnitDescription

Current CPU Usage

percent

Instantaneous usage at the time of the last query.

Aggregate CPU Usage

percent

Average non-idle CPU activity of all cores on a node.

Aggr. CPU Usage by Type

percent

Shows time spent for each type of thread averaged across all cores.

4.4.2.4. Memory panels

Panels that display memory usage on the node.

PanelUnitDescription

Memory Used

percent

Amount of memory being used at time of last query.

Huge Pages Used

hugepages

Number of hugepages being used.

Memory

4.4.2.5. Disk/file system

Panels that display space used on disk.

PanelUnitDescriptionNotes

Disk Space Usage

percent

Total disk use at time of last query.

 

Inode Usage

percent

Total inode use at time of last query.

 

Aggregate Disk Space Usage

bytes

Total disk space used and reserved.

Because this query relies on the df plugin, temporary file systems that do not necessarily use disk space are included in the results. The query tries to filter out most of these, but it might not be exhaustive.

Disk Traffic

bytes/s

Shows rates for both reading and writing.

 

Disk Load

percent

Approximate percentage of total disk bandwidth being used. The weighted I/O time series includes the backlog that might be accumulating. For more information, see the collectd disk plugin docs.

 

Operations/s

ops/s

Operations done per second

 

Average I/O Operation Time

seconds

Average time each I/O operation took to complete. This average is not accurate, see the collectd disk plugin docs.

 

4.5. Configuring multiple clouds

You can configure multiple Red Hat OpenStack Platform clouds to target a single instance of Service Telemetry Framework (STF):

  1. Plan the AMQP address prefixes that you want to use for each cloud. For more information, see Section 4.5.1, “Planning AMQP address prefixes”.
  2. Deploy metrics and events consumer Smart Gateways for each cloud to listen on the corresponding address prefixes. For more information, see Section 4.5.2, “Deploying Smart Gateways”.
  3. Configure each cloud to send its metrics and events to STF on the correct address. For more information, see Section 4.5.3, “Creating the OpenStack environment file”.

Figure 4.1. Two Red Hat OpenStack Platform clouds connect to STF

OpenStack STF Overview 37 0919 topology

4.5.1. Planning AMQP address prefixes

By default, Red Hat OpenStack Platform nodes get data through two data collectors; collectd and Ceilometer. These components send telemetry data or notifications to the respective AMQP addresses, for example, collectd/telemetry, where STF Smart Gateways listen on those addresses for monitoring data.

To support multiple clouds and to identify which cloud generated the monitoring data, configure each cloud to send data to a unique address. Prefix a cloud identifier to the second part of the address. The following list shows some example addresses and identifiers:

  • collectd/cloud1-telemetry
  • collectd/cloud1-notify
  • anycast/ceilometer/cloud1-event.sample
  • collectd/cloud2-telemetry
  • collectd/cloud2-notify
  • anycast/ceilometer/cloud2-event.sample
  • collectd/us-east-1-telemetry
  • collectd/us-west-3-telemetry

4.5.2. Deploying Smart Gateways

You must deploy a Smart Gateway for each of the data collection types for each cloud; one for collectd metrics, one for collectd events, and one for Ceilometer events. Configure each of the Smart Gateways to listen on the AMQP address that you define for the corresponding cloud.

When you deploy STF for the first time, Smart Gateway manifests are created that define the initial Smart Gateways for a single cloud. When deploying Smart Gateways for multiple cloud support, you deploy multiple Smart Gateways for each of the data collection types that handle the metrics and the events data for each cloud. The initial Smart Gateways act as a template to create additional Smart Gateways, along with any authentication information required to connect to the data stores.

Procedure
  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    oc project service-telemetry
  3. Use the initially deployed Smart Gateways as a template for additional Smart Gateways. List the currently deployed Smart Gateways with the oc get smartgateways command. For example, if you deployed STF with metricsEnabled: true and eventsEnabled: true, the following Smart Gateways are displayed in the output:

    $ oc get smartgateways
    
    NAME                                         AGE
    stf-default-ceilometer-notification          14d
    stf-default-collectd-notification            14d
    stf-default-collectd-telemetry               14d
  4. Retrieve the manifests for each Smart Gateway and store the contents in a temporary file, which you can modify later and use to create the new set of Smart Gateways:

    truncate --size 0 /tmp/cloud1-smartgateways.yaml && \
    for sg in $(oc get smartgateways -oname)
    do
      echo "---" >> /tmp/cloud1-smartgateways.yaml
      oc get ${sg} -oyaml --export >> /tmp/cloud1-smartgateways.yaml
    done
  5. Modify the Smart Gateway manifest in the /tmp/cloud1-smartgateways.yaml file. Adjust the metadata.name and spec.amqpUrl fields to include the cloud identifier from your schema. For more information, see ]. To view example Smart Gateway manifests, see <<example-manifests_advanced-features[.
  6. Deploy your new Smart Gateways:

    oc apply -f /tmp/cloud1-smartgateways.yaml
  7. Verify that each Smart Gateway is running. This can take several minutes depending on the number of Smart Gateways:

    oc get po -l app=smart-gateway

4.5.2.1. Example manifests

Important

The content in the following examples might be different to the file content in your deployment. Copy the manifests in your deployment.

Ensure that the name and amqpUrl parameters of each Smart Gateway match the names that you want to use for your clouds. For more information, see Section 4.5.1, “Planning AMQP address prefixes”.

Note

Your output may have some additional metadata parameters that you can remove from the manifests you that load into OCP.

apiVersion: smartgateway.infra.watch/v2alpha1
kind: SmartGateway
metadata:
  name: stf-default-ceilometer-notification-cloud1  1
spec:
  amqpDataSource: ceilometer
  amqpUrl: stf-default-interconnect.service-telemetry.svc.cluster.local:5672/anycast/ceilometer/cloud1-event.sample  2
  debug: false
  elasticPass: fkzfhghw......
  elasticUrl: https://elasticsearch-es-http.service-telemetry.svc.cluster.local:9200
  elasticUser: elastic
  resetIndex: false
  serviceType: events
  size: 1
  tlsCaCert: /config/certs/ca.crt
  tlsClientCert: /config/certs/tls.crt
  tlsClientKey: /config/certs/tls.key
  tlsServerName: elasticsearch-es-http.service-telemetry.svc.cluster.local
  useBasicAuth: true
  useTls: true
---
apiVersion: smartgateway.infra.watch/v2alpha1
kind: SmartGateway
metadata:
  name: stf-default-collectd-notification-cloud1  3
spec:
  amqpDataSource: collectd
  amqpUrl: stf-default-interconnect.service-telemetry.svc.cluster.local:5672/collectd/cloud1-notify  4
  debug: false
  elasticPass: fkzfhghw......
  elasticUrl: https://elasticsearch-es-http.service-telemetry.svc.cluster.local:9200
  elasticUser: elastic
  resetIndex: false
  serviceType: events
  size: 1
  tlsCaCert: /config/certs/ca.crt
  tlsClientCert: /config/certs/tls.crt
  tlsClientKey: /config/certs/tls.key
  tlsServerName: elasticsearch-es-http.service-telemetry.svc.cluster.local
  useBasicAuth: true
  useTls: true
---
apiVersion: smartgateway.infra.watch/v2alpha1
kind: SmartGateway
metadata:
  name: stf-default-collectd-telemetry-cloud1 5
spec:
  amqpUrl: stf-default-interconnect.service-telemetry.svc.cluster.local:5672/collectd/cloud1-telemetry  6
  debug: false
  prefetch: 15000
  serviceType: metrics
  size: 1
  useTimestamp: true
1
Name for Ceilometer notifications for cloud1
2
AMQP Address for Ceilometer notifications for cloud1
3
Name for collectd telemetry for cloud1
4
AMQP Address for collectd telemetry for cloud1
5
Name for collectd notifications for cloud1
6
AMQP Address for collectd notifications for cloud1

4.5.3. Creating the OpenStack environment file

To label traffic according to the cloud of origin, you must create a configuration with cloud-specific instance names. Create an stf-connectors.yaml file and adjust the values of CeilometerQdrEventsConfig and CollectdAmqpInstances to match the AMQP address prefix scheme. For more information, see Section 4.5.1, “Planning AMQP address prefixes”.

Warning

Remove enable-stf.yaml and ceilometer-write-qdr.yaml environment file references from your overcloud deployment. This configuration is redundant and results in duplicate information being sent from each cloud node.

Procedure
  1. Create the stf-connectors.yaml file and modify it to match the AMQP address that you want for this cloud deployment:
resource_registry:
    OS::TripleO::Services::Collectd: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml
    OS::TripleO::Services::MetricsQdr: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/qdr-container-puppet.yaml
    OS::TripleO::Services::CeilometerAgentCentral: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-central-container-puppet.yaml
    OS::TripleO::Services::CeilometerAgentNotification: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-notification-container-puppet.yaml
    OS::TripleO::Services::CeilometerAgentIpmi: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-ipmi-container-puppet.yaml
    OS::TripleO::Services::ComputeCeilometerAgent: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-compute-container-puppet.yaml
    OS::TripleO::Services::Redis: /usr/share/openstack-tripleo-heat-templates/deployment/database/redis-pacemaker-puppet.yaml

parameter_defaults:
    EnableSTF: true

    EventPipelinePublishers: []
    CeilometerEnablePanko: false
    CeilometerQdrPublishEvents: true
    CeilometerQdrEventsConfig:
        driver: amqp
        topic: cloud1-event   1

    CollectdConnectionType: amqp1
    CollectdAmqpInterval: 5
    CollectdDefaultPollingInterval: 5

    CollectdAmqpInstances:
        cloud1-notify:        2
            notify: true
            format: JSON
            presettle: false
        cloud1-telemetry:     3
            format: JSON
            presettle: true

    MetricsQdrAddresses:
        - prefix: collectd
          distribution: multicast
        - prefix: anycast/ceilometer
          distribution: multicast

    MetricsQdrSSLProfiles:
        - name: sslProfile

    MetricsQdrConnectors:
        - host: stf-default-interconnect-5671-service-telemetry.apps.infra.watch   4
          port: 443
          role: edge
          verifyHostname: false
          sslProfile: sslProfile

+ <1> Define the topic for Ceilometer events. This value is the address format of anycast/ceilometer/cloud1-event.sample. <2> Define the topic for collectd events. This value is the format of collectd/cloud1-notify. <3> Define the topic for collectd metrics. This value is the format of collectd/cloud1-telemetry. <4> Adjust the MetricsQdrConnectors host to the address of the STF route.

  1. Ensure that the naming convention in the stf-connectors.yaml file aligns with the spec.amqpUrl field in the Smart Gateway configuration. For example, configure the CeilometerQdrEventsConfig.topic field to a value of cloud1-event.
  2. Save the file in a directory for custom environment files, for example /home/stack/custom_templates/.
  3. Source the authentication file:

    [stack@undercloud-0 ~]$ source stackrc
    
    (undercloud) [stack@undercloud-0 ~]$
  4. Include the stf-connectors.yaml file in the overcloud deployment command, along with any other environment files relevant to your environment:

    (undercloud) [stack@undercloud-0 ~]$ openstack overcloud deploy \
    --templates /usr/share/openstack-tripleo-heat-templates \
    ...
    -e /home/stack/custom_templates/stf-connectors.yaml \
    ...
Additional resources

For information about validating the deployment, see Section 3.3.3, “Validating client-side installation”.

4.5.4. Querying metrics data from multiple clouds

Data stored in Prometheus has a service label attached according to the Smart Gateway it was scraped from. You can use this label to query data from a specific cloud.

To query data from a specific cloud, use a Prometheus promql query that matches the associated service label; for example: collectd_uptime{service="stf-default-collectd-telemetry-cloud1-smartgateway"}.

4.6. Ephemeral storage

Use ephemeral storage to run Service Telemetry Framework (STF) without persistently storing data in your Red Hat OpenShift Container Platform (OCP) cluster. Ephemeral storage is not recommended in a production environment due to the volatility of the data in the platform when operating correctly and as designed. For example, restarting a pod or rescheduling the workload to another node results in the loss of any local data written since the pod started.

If you enable ephemeral storage in STF, the Service Telemetry Operator does not add the relevant storage sections to the data storage components manifests.

4.6.1. Configuring ephemeral storage

To configure STF for ephemeral storage, add storageEphemeralEnabled: true to the ServiceTelemetry object in OCP. You can add storageEphemeralEnabled: true at installation time or, if you already deployed STF, complete the following steps:

Procedure

  1. Log in to Red Hat OpenShift Container Platform.
  2. Change to the service-telemetry namespace:

    oc project service-telemetry
  3. Edit the ServiceTelemetry object:

    $ oc edit ServiceTelemetry stf-default
  4. Add the storageEphemeralEnabled: true parameter to the spec section:

    spec:
      eventsEnabled: true
      metricsEnabled: true
      storageEphemeralEnabled: true
  5. Save your changes and close the object.