Chapter 4. Advanced features
The following optional features can provide additional functionality to the Service Telemetry Framework (STF):
- Section 4.1, “Customizing the deployment”
- Section 4.2, “Alerts”
- Section 4.3, “Configuring SNMP Traps”
- Section 4.4, “High availability”
- Section 4.5, “Dashboards”
- Section 4.6, “Multiple cloud configuration”
- Section 4.7, “Ephemeral storage”
- Section 4.8, “Monitoring the resource usage of Red Hat OpenStack Platform services”
4.1. Customizing the deployment
The Service Telemetry Operator watches for a ServiceTelemetry
manifest to load into Red Hat OpenShift Container Platform (OCP). The Operator then creates other objects in memory, which results in the dependent Operators creating the workloads they are responsible for managing.
When you override the manifest, you must provide the entire manifest contents, including object names or namespaces. There is no dynamic parameter substitution when you override a manifest.
Use manifest overrides only as a last resort short circuit.
To override a manifest successfully with Service Telemetry Framework (STF), deploy a default environment using the core options only. For more information about the core options, see Section 2.3.11, “Creating a ServiceTelemetry object in OCP”. When you deploy STF, use the oc get
command to retrieve the default deployed manifest. When you use a manifest that was originally generated by Service Telemetry Operator, the manifest is compatible with the other objects that are managed by the Operators.
For example, when the backends.metrics.prometheus.enabled: true
parameter is configured in the ServiceTelemetry
manifest, the Service Telemetry Operator requests components for metrics retrieval and storage using the default manifests. In some cases, you might want to override the default manifest. For more information, see Section 4.1.1, “Manifest override parameters”.
4.1.1. Manifest override parameters
This table describes the available parameters that you can use to override a manifest, along with the corresponding retrieval commands.
Table 4.1. Manifest override parameters
Override parameter | Description | Retrieval command |
---|---|---|
|
Override the |
|
|
Override the |
|
|
Override the |
|
|
Override the |
|
|
Override the |
|
|
Override the |
|
4.1.2. Overriding a managed manifest
Edit the ServiceTelemetry
object and provide a parameter and manifest. For a list of available manifest override parameters, see Section 4.1, “Customizing the deployment”. The default ServiceTelemetry
object is default
. Use oc get servicetelemetry
to list the available STF deployments.
The oc edit
command loads the default system editor. To override the default editor, pass or set the environment variable EDITOR
to the preferred editor. For example, EDITOR=nano oc edit servicetelemetry default
.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetry
namespace:$ oc project service-telemetry
Load the
ServiceTelemetry
object into an editor:$ oc edit servicetelemetry default
To modify the
ServiceTelemetry
object, provide a manifest override parameter and the contents of the manifest to write to OCP instead of the defaults provided by STF.NoteThe trailing pipe (
|
) after entering the manifest override parameter indicates that the value provided is multi-line.$ oc edit stf default apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: ... spec: alertmanagerConfigManifest: | 1 apiVersion: v1 kind: Secret metadata: name: 'alertmanager-default' namespace: 'service-telemetry' type: Opaque stringData: alertmanager.yaml: |- global: resolve_timeout: 10m route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' receivers: - name: 'null' 2 status: ...
- Save and close.
4.2. Alerts
You create alert rules in Prometheus and alert routes in Alertmanager. Alert rules in Prometheus servers send alerts to an Alertmanager, which manages the alerts. Alertmanager can silence, inhibit, or aggregate alerts, and send notifications using email, on-call notification systems, or chat platforms.
To create an alert, complete the following tasks:
- Create an alert rule in Prometheus. For more information, see Section 4.2.1, “Creating an alert rule in Prometheus”.
- Create an alert route in Alertmanager. For more information, see Section 4.2.3, “Creating an alert route in Alertmanager”.
Additional resources
For more information about alerts or notifications with Prometheus and Alertmanager, see https://prometheus.io/docs/alerting/overview/
To view an example set of alerts that you can use with Service Telemetry Framework (STF), see https://github.com/infrawatch/service-telemetry-operator/tree/master/deploy/alerts
4.2.1. Creating an alert rule in Prometheus
Prometheus evaluates alert rules to trigger notifications. If the rule condition returns an empty result set, the condition is false. Otherwise, the rule is true and it triggers an alert.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetry
namespace:$ oc project service-telemetry
Create a
PrometheusRule
object that contains the alert rule. The Prometheus Operator loads the rule into Prometheus:$ oc apply -f - <<EOF apiVersion: monitoring.coreos.com/v1 kind: PrometheusRule metadata: creationTimestamp: null labels: prometheus: default role: alert-rules name: prometheus-alarm-rules namespace: service-telemetry spec: groups: - name: ./openstack.rules rules: - alert: Metric Listener down expr: collectd_qpid_router_status < 1 # To change the rule, edit the value of the expr parameter. EOF
To verify that the rules have been loaded into Prometheus by the Operator, create a pod with access to
curl
:$ oc run curl --generator=run-pod/v1 --image=radial/busyboxplus:curl -i --tty
Run
curl
to access theprometheus-operated
service to return the rules loaded into memory:[ root@curl:/ ]$ curl prometheus-operated:9090/api/v1/rules {"status":"success","data":{"groups":[{"name":"./openstack.rules","file":"/etc/prometheus/rules/prometheus-default-rulefiles-0/service-telemetry-prometheus-alarm-rules.yaml","rules":[{"name":"Metric Listener down","query":"collectd_qpid_router_status \u003c 1","duration":0,"labels":{},"annotations":{},"alerts":[],"health":"ok","type":"alerting"}],"interval":30}]}}
To verify that the output shows the rules loaded into the
PrometheusRule
object, for example the output contains the defined./openstack.rules
, exit from the pod:[ root@curl:/ ]$ exit
Clean up the environment by deleting the
curl
pod:$ oc delete pod curl pod "curl" deleted
Additional resources
For more information on alerting, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md
4.2.2. Configuring custom alerts
You can add custom alerts to the PrometheusRule
object that you created in Section 4.2.1, “Creating an alert rule in Prometheus”.
Procedure
Use the
oc edit
command:$ oc edit prometheusrules prometheus-alarm-rules
- Edit the PrometheusRules manifest.
- Save and close.
Additional resources
- For more information about configuring alerting rules, see https://prometheus.io/docs/prometheus/latest/configuration/alerting_rules/.
- For more information about PrometheusRules objects, see https://github.com/coreos/prometheus-operator/blob/master/Documentation/user-guides/alerting.md
4.2.3. Creating an alert route in Alertmanager
Use Alertmanager to deliver alerts to an external system, such as email, IRC, or other notification channel. The Prometheus Operator manages the Alertmanager configuration as an Red Hat OpenShift Container Platform (OCP) secret. STF by default deploys a basic configuration that results in no receivers:
alertmanager.yaml: |- global: resolve_timeout: 5m route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' receivers: - name: 'null'
To deploy a custom Alertmanager route with STF, an alertmanagerConfigManifest
parameter must be passed to the Service Telemetry Operator that results in an updated secret, managed by the Prometheus Operator. For more information, see Section 4.1.2, “Overriding a managed manifest”.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetry
namespace:$ oc project service-telemetry
Edit the
ServiceTelemetry
object for your STF deployment$ oc edit stf default
Add a new parameter,
alertmanagerConfigManifest
, and theSecret
object contents to define thealertmanager.yaml
configuration for Alertmanager:NoteThis step loads the default template that is already managed by Service Telemetry Operator. To verify that the changes are populating correctly, change a value, return the
alertmanager-default
secret, and verify that the new value is loaded into memory. For example, change the valueglobal.resolve_timeout
from5m
to10m
.apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: default namespace: service-telemetry spec: backends: metrics: prometheus: enabled: true alertmanagerConfigManifest: | apiVersion: v1 kind: Secret metadata: name: 'alertmanager-default' namespace: 'service-telemetry' type: Opaque stringData: alertmanager.yaml: |- global: resolve_timeout: 10m route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' receivers: - name: 'null'
Verify that the configuration was applied to the secret:
$ oc get secret alertmanager-default -o go-template='{{index .data "alertmanager.yaml" | base64decode }}' global: resolve_timeout: 10m route: group_by: ['job'] group_wait: 30s group_interval: 5m repeat_interval: 12h receiver: 'null' receivers: - name: 'null'
To verify the configuration has been loaded into Alertmanager, create a pod with access to
curl
:$ oc run curl --generator=run-pod/v1 --image=radial/busyboxplus:curl -i --tty
Run
curl
against thealertmanager-operated
service to retrieve the status andconfigYAML
contents and review the supplied configuration matches the configuration loaded into Alertmanager:[ root@curl:/ ]$ curl alertmanager-operated:9093/api/v1/status {"status":"success","data":{"configYAML":"global:\n resolve_timeout: 10m\n http_config: {}\n smtp_hello: localhost\n smtp_require_tls: true\n pagerduty_url: https://events.pagerduty.com/v2/enqueue\n hipchat_api_url: https://api.hipchat.com/\n opsgenie_api_url: https://api.opsgenie.com/\n wechat_api_url: https://qyapi.weixin.qq.com/cgi-bin/\n victorops_api_url: https://alert.victorops.com/integrations/generic/20131114/alert/\nroute:\n receiver: \"null\"\n group_by:\n - job\n group_wait: 30s\n group_interval: 5m\n repeat_interval: 12h\nreceivers:\n- name: \"null\"\ntemplates: []\n",...}}
Verify that the
configYAML
field contains the expected changes. Exit from the pod:[ root@curl:/ ]$ exit
To clean up the environment, delete the
curl
pod:$ oc delete pod curl pod "curl" deleted
Additional resources
- For more information about the Red Hat OpenShift Container Platform secret and the Prometheus operator, see Alerting.
4.3. Configuring SNMP Traps
You can integrate Service Telemetry Framework (STF) with an existing infrastructure monitoring platform that receives notifications via SNMP traps. To enable SNMP traps, modify the ServiceTelemetry
object and configure the snmpTraps
parameters.
For more information about configuring alerts, see Section 4.2, “Alerts”.
Prerequisites
- Know the IP address or hostname of the SNMP trap receiver where you want to send the alerts
Procedure
To enable SNMP traps, modify the
ServiceTelemetry
object:$ oc edit stf default
Set the
alerting.alertmanager.receivers.snmpTraps
parameters:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry ... spec: ... alerting: alertmanager: receivers: snmpTraps: enabled: true target: 10.10.10.10
-
Ensure that you set the value of
target
to the IP address or hostname of the SNMP trap receiver.
4.4. High availability
High availability is the ability of Service Telemetry Framework (STF) to rapidly recover from failures in its component services. Although Red Hat OpenShift Container Platform (OCP) restarts a failed pod if nodes are available to schedule the workload, this recovery process might take more than one minute, during which time events and metrics are lost. A high availability configuration includes multiple copies of STF components, reducing recovery time to approximately 2 seconds. To protect against failure of an OCP node, deploy STF to an OCP cluster with three or more nodes.
STF is not yet a fully fault tolerant system. Delivery of metrics and events during the recovery period is not guaranteed.
Enabling high availability has the following effects:
- Three ElasticSearch pods run instead of the default one.
The following components run two pods instead of the default one:
- AMQ Interconnect
- Alertmanager
- Prometheus
- Events Smart Gateway
- Collectd Metrics Smart Gateway
- Recovery time from a lost pod in any of these services reduces to approximately 2 seconds.
The Ceilometer Metrics Smart Gateway is not yet HA
4.4.1. Configuring high availability
To configure STF for high availability, add highAvailability.enabled: true
to the ServiceTelemetry object in OCP. You can this set this parameter at installation time or, if you already deployed STF, complete the following steps:
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetry
namespace:$ oc project service-telemetry
Use the oc command to edit the ServiceTelemetry object:
$ oc edit stf default
Add
highAvailability.enabled: true
to thespec
section:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry ... spec: ... highAvailability: enabled: true
- Save your changes and close the object.
4.5. Dashboards
Use third-party application Grafana to visualize system-level metrics gathered by collectd for each individual host node. For more information about configuring collectd, see Section 3.3, “Configuring Red Hat OpenStack Platform overcloud for Service Telemetry Framework”.
4.5.1. Setting up Grafana to host the dashboard
Grafana is not included in the default Service Telemetry Framework (STF) deployment so you must deploy the Grafana Operator from OperatorHub.io. Using the Service Telemetry Operator to deploy Grafana results in a Grafana instance and the configuration of the default data sources for the local STF deployment.
Prerequisites
Enable OperatorHub.io catalog source for the Grafana Operator. For more information, see Section 2.3.5, “Enabling the OperatorHub.io Community Catalog Source”.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetry
namespace:$ oc project service-telemetry
Deploy the Grafana operator:
$ oc apply -f - <<EOF apiVersion: operators.coreos.com/v1alpha1 kind: Subscription metadata: name: grafana-operator namespace: service-telemetry spec: channel: alpha installPlanApproval: Automatic name: grafana-operator source: operatorhubio-operators sourceNamespace: openshift-marketplace EOF
To verify that the operator launched successfully, run the
oc get csv
command. If the value of the PHASE column isSucceeded
, the operator launched successfully:$ oc get csv
NAME DISPLAY VERSION REPLACES PHASE grafana-operator.v3.2.0 Grafana Operator 3.2.0 Succeeded ...
To launch a Grafana instance, create or modify the
ServiceTelemetry
object. Setgraphing.enabled
totrue
.$ oc edit stf default
apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry ... spec: ... graphing: enabled: true
Verify that the Grafana instance deployed:
$ oc get pod -l app=grafana
NAME READY STATUS RESTARTS AGE grafana-deployment-7fc7848b56-sbkhv 1/1 Running 0 1m
4.5.2. Importing dashboards
The Grafana Operator can import and manage dashboards by creating GrafanaDashboard
objects. You can view example dashboards at https://github.com/infrawatch/dashboards.
Procedure
Import a dashboard:
$ oc apply -f https://raw.githubusercontent.com/infrawatch/dashboards/master/deploy/rhos-dashboard.yaml
grafanadashboard.integreatly.org/rhos-dashboard created
Verify that the resources installed correctly:
$ oc get grafanadashboards
NAME AGE rhos-dashboard 7d21h
$ oc get grafanadatasources
NAME AGE default-ds-prometheus 20h
Expose the grafana service as a route:
$ oc create route edge dashboards --service=grafana-service --insecure-policy="Redirect" --port=3000
Retrieve the Grafana route address:
$ oc get route dashboards
NAME HOST/PORT PATH SERVICES PORT TERMINATION WILDCARD dashboards dashboards-service-telemetry.apps.stfcloudops1.lab.upshift.rdu2.redhat.com grafana-service 3000 edge/Redirect None
The
HOST/PORT
value is the Grafana route address.-
Navigate to https://<GRAFANA-ROUTE-ADDRESS> in a web browser. Replace <GRAFANA-ROUTE-ADDRESS> with the
HOST/PORT
value that you retrieved in the previous step. - To view the dashboard, click Dashboards and Manage.
4.5.3. Viewing and editing queries
Procedure
-
Log in to Red Hat OpenShift Container Platform. To view and edit queries, log in as the
admin
user. Change to the
service-telemetry
namespace:$ oc project service-telemetry
To retrieve the default username and password, describe the Grafana object using the
oc describe
command:$ oc describe grafana default
TipTo set the admin username and password through the
ServiceTelemetry
object, use thegraphing.grafana.adminUser
andgraphing.grafana.adminPassword
parameters.
4.5.4. The Grafana infrastructure dashboard
The infrastructure dashboard shows metrics for a single node at a time. Select a node from the upper left corner of the dashboard.
4.5.4.1. Top panels
Title | Unit | Description |
Current Global Alerts | - | Current alerts fired by Prometheus |
Recent Global Alerts | - | Recently fired alerts in 5m time steps |
Status Panel | - | Node status: up, down, unavailable |
Uptime | s/m/h/d/M/Y | Total operational time of node |
CPU Cores | cores | Total number of cores |
Memory | bytes | Total memory |
Disk Size | bytes | Total storage size |
Processes | processes | Total number of processes listed by type |
Load Average | processes | Load average represents the average number of running and uninterruptible processes residing in the kernel execution queue. |
4.5.4.2. Networking panels
Panels that display the network interfaces of the node.
Panel | Unit | Description |
Physical Interfaces Ingress Errors | errors | Total errors with incoming data |
Physical Interfaces Egress Errors | errors | Total errors with outgoing data |
Physical Interfaces Ingress Error Rates | errors/s | Rate of incoming data errors |
Physical Interfaces egress Error Rates | errors/s | Rate of outgoing data errors |
Physical Interfaces Packets Ingress pps Incoming packets per second | Physical Interfaces Packets Egress | pps |
Outgoing packets per second | Physical Interfaces Data Ingress | bytes/s |
Incoming data rates | Physical Interfaces Data Egress | bytes/s |
Outgoing data rates | Physical Interfaces Drop Rate Ingress | pps |
Incoming packets drop rate | Physical Interfaces Drop Rate Egress | pps |
4.5.4.3. CPU panels
Panels that display CPU usage of the node.
Panel | Unit | Description |
---|---|---|
Current CPU Usage | percent | Instantaneous usage at the time of the last query. |
Aggregate CPU Usage | percent | Average non-idle CPU activity of all cores on a node. |
Aggr. CPU Usage by Type | percent | Shows time spent for each type of thread averaged across all cores. |
4.5.4.4. Memory panels
Panels that display memory usage on the node.
Panel | Unit | Description |
---|---|---|
Memory Used | percent | Amount of memory being used at time of last query. |
Huge Pages Used | hugepages | Number of hugepages being used. Memory |
4.5.4.5. Disk/file system
Panels that display space used on disk.
Panel | Unit | Description | Notes |
---|---|---|---|
Disk Space Usage | percent | Total disk use at time of last query. | |
Inode Usage | percent | Total inode use at time of last query. | |
Aggregate Disk Space Usage | bytes | Total disk space used and reserved. |
Because this query relies on the |
Disk Traffic | bytes/s | Shows rates for both reading and writing. | |
Disk Load | percent | Approximate percentage of total disk bandwidth being used. The weighted I/O time series includes the backlog that might be accumulating. For more information, see the collectd disk plugin docs. | |
Operations/s | ops/s | Operations done per second | |
Average I/O Operation Time | seconds | Average time each I/O operation took to complete. This average is not accurate, see the collectd disk plugin docs. |
4.6. Multiple cloud configuration
You can configure multiple Red Hat OpenStack Platform clouds to target a single instance of Service Telemetry Framework (STF):
- Plan the AMQP address prefixes that you want to use for each cloud. For more information, see Section 4.6.1, “Planning AMQP address prefixes”.
- Deploy metrics and events consumer Smart Gateways for each cloud to listen on the corresponding address prefixes. For more information, see Section 4.6.2, “Deploying Smart Gateways”.
- Configure each cloud to send its metrics and events to STF on the correct address. For more information, see Section 4.6.4, “Creating the OpenStack environment file”.
Figure 4.1. Two Red Hat OpenStack Platform clouds connect to STF

4.6.1. Planning AMQP address prefixes
By default, Red Hat OpenStack Platform nodes get data through two data collectors; collectd and Ceilometer. These components send telemetry data or notifications to the respective AMQP addresses, for example, collectd/telemetry
, where STF Smart Gateways listen on those addresses for monitoring data.
To support multiple clouds and to identify which cloud generated the monitoring data, configure each cloud to send data to a unique address. Prefix a cloud identifier to the second part of the address. The following list shows some example addresses and identifiers:
-
collectd/cloud1-telemetry
-
collectd/cloud1-notify
-
anycast/ceilometer/cloud1-metering.sample
-
anycast/ceilometer/cloud1-event.sample
-
collectd/cloud2-telemetry
-
collectd/cloud2-notify
-
anycast/ceilometer/cloud2-metering.sample
-
anycast/ceilometer/cloud2-event.sample
-
collectd/us-east-1-telemetry
-
collectd/us-west-3-telemetry
4.6.2. Deploying Smart Gateways
You must deploy a Smart Gateway for each of the data collection types for each cloud; one for collectd metrics, one for collectd events, one for Ceilometer metrics, and one for Ceilometer events. Configure each of the Smart Gateways to listen on the AMQP address that you define for the corresponding cloud. Smart Gateways are defined via the clouds
parameter in the ServiceTelemetry
manifest.
When you deploy STF for the first time, Smart Gateway manifests are created that define the initial Smart Gateways for a single cloud. When deploying Smart Gateways for multiple cloud support, you deploy multiple Smart Gateways for each of the data collection types that handle the metrics and the events data for each cloud. The initial Smart Gateways are defined under cloud1
with the following subscription addresses:
collector | type | default subscription address |
collectd | metrics | collectd/telemetry |
collectd | events | collectd/notify |
Ceilometer | metrics | anycast/ceilometer/metering.sample |
Ceilometer | events | anycast/ceilometer/event.sample |
Prerequisites
You have determined your naming scheme and have created your list of clouds objects. For more information about determining your naming scheme, see ]. For more information about creating the content for the clouds
parameter, see xref:clouds_installing-the-core-components-of-stf[.
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetry
namespace:$ oc project service-telemetry
Edit the
default
ServiceTelemetry object and add aclouds
parameter with your configuration:$ oc edit stf default
apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: ... spec: ... clouds: - name: cloud1 events: collectors: - collectorType: collectd subscriptionAddress: collectd/cloud1-notify - collectorType: ceilometer subscriptionAddress: anycast/ceilometer/cloud1-event.sample metrics: collectors: - collectorType: collectd subscriptionAddress: collectd/cloud1-telemetry - collectorType: ceilometer subscriptionAddress: anycast/ceilometer/cloud1-metering.sample - name: cloud2 events: ...
- Save the ServiceTelemetry object.
Verify that each Smart Gateway is running. This can take several minutes depending on the number of Smart Gateways:
$ oc get po -l app=smart-gateway
NAME READY STATUS RESTARTS AGE default-cloud1-ceil-event-smartgateway-6cfb65478c-g5q82 1/1 Running 0 13h default-cloud1-ceil-meter-smartgateway-58f885c76d-xmxwn 1/1 Running 0 13h default-cloud1-coll-event-smartgateway-58fbbd4485-rl9bd 1/1 Running 0 13h default-cloud1-coll-meter-smartgateway-7c6fc495c4-jn728 2/2 Running 0 13h
4.6.3. Deleting the default Smart Gateways
After you configure STF for multiple clouds, you can delete the default Smart Gateways if they are no longer in use. The Service Telemetry Operator can remove SmartGateway
objects that have been created but are no longer listed in the ServiceTelemetry clouds
list of objects. You can enable the removal of SmartGateway objects that are not defined by the clouds
parameter by setting cloudsRemoveOnMissing: true
in the ServiceTelemetry
manifest.
If you do not want any Smart Gateways deployed, define an empty clouds object using the clouds: {}
parameter.
The cloudsRemoveOnMissing
parameter is disabled by default. If you enable the cloudsRemoveOnMissing
parameter, you remove any manually created SmartGateway
objects in the current namespace without any possibility to restore.
Procedure
-
Define your
clouds
parameter with the list of cloud objects to be managed by the Service Telemetry Operator. For more information, see Section 2.3.10.2, “clouds”. Edit the ServiceTelemetry object and add the
cloudsRemoveOnMissing
parameter:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: ... spec: ... cloudsRemoveOnMissing: true clouds: ...
- Save the modifications.
Verify that the Operator deleted the Smart Gateways. This can take several minutes while the Operators reconcile the changes:
$ oc get smartgateways
4.6.4. Creating the OpenStack environment file
To label traffic according to the cloud of origin, you must create a configuration with cloud-specific instance names. Create an stf-connectors.yaml
file and adjust the values of CeilometerQdrEventsConfig
, CeilometerQdrMetricsConfig
and CollectdAmqpInstances
to match the AMQP address prefix scheme. For more information, see Section 4.6.1, “Planning AMQP address prefixes”.
Remove enable-stf.yaml
and ceilometer-write-qdr.yaml
environment file references from your overcloud deployment. This configuration is redundant and results in duplicate information being sent from each cloud node.
Procedure
Create the
stf-connectors.yaml
file and modify it to match the AMQP address that you want for this cloud deployment:stf-connectors.yaml
resource_registry: OS::TripleO::Services::Collectd: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/collectd-container-puppet.yaml OS::TripleO::Services::MetricsQdr: /usr/share/openstack-tripleo-heat-templates/deployment/metrics/qdr-container-puppet.yaml OS::TripleO::Services::CeilometerAgentCentral: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-central-container-puppet.yaml OS::TripleO::Services::CeilometerAgentNotification: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-notification-container-puppet.yaml OS::TripleO::Services::CeilometerAgentIpmi: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-ipmi-container-puppet.yaml OS::TripleO::Services::ComputeCeilometerAgent: /usr/share/openstack-tripleo-heat-templates/deployment/ceilometer/ceilometer-agent-compute-container-puppet.yaml OS::TripleO::Services::Redis: /usr/share/openstack-tripleo-heat-templates/deployment/database/redis-pacemaker-puppet.yaml parameter_defaults: EnableSTF: true EventPipelinePublishers: [] MetricPipelinePublishers: [] CeilometerEnablePanko: false CeilometerQdrPublishEvents: true CeilometerQdrEventsConfig: driver: amqp topic: cloud1-event 1 CeilometerQdrMetricsConfig: driver: amqp topic: cloud1-metering 2 CollectdConnectionType: amqp1 CollectdAmqpInterval: 5 CollectdDefaultPollingInterval: 5 CollectdAmqpInstances: cloud1-notify: 3 notify: true format: JSON presettle: false cloud1-telemetry: 4 format: JSON presettle: true MetricsQdrAddresses: - prefix: collectd distribution: multicast - prefix: anycast/ceilometer distribution: multicast MetricsQdrSSLProfiles: - name: sslProfile MetricsQdrConnectors: - host: stf-default-interconnect-5671-service-telemetry.apps.infra.watch 5 port: 443 role: edge verifyHostname: false sslProfile: sslProfile
- 1
- Define the topic for Ceilometer events. This value is the address format of
anycast/ceilometer/cloud1-event.sample
. - 2
- Define the topic for Ceilometer metrics. This value is the address format of
anycast/ceilometer/cloud1-metering.sample
. - 3
- Define the topic for collectd events. This value is the format of
collectd/cloud1-notify
. - 4
- Define the topic for collectd metrics. This value is the format of
collectd/cloud1-telemetry
. - 5
- Adjust the
MetricsQdrConnectors
host to the address of the STF route.
-
Ensure that the naming convention in the
stf-connectors.yaml
file aligns with thespec.amqpUrl
field in the Smart Gateway configuration. For example, configure theCeilometerQdrEventsConfig.topic
field to a value ofcloud1-event
. -
Save the file in a directory for custom environment files, for example
/home/stack/custom_templates/
. Source the authentication file:
[stack@undercloud-0 ~]$ source stackrc (undercloud) [stack@undercloud-0 ~]$
Include the
stf-connectors.yaml
file in theovercloud deployment
command, along with any other environment files relevant to your environment:(undercloud) [stack@undercloud-0 ~]$ openstack overcloud deploy \ --templates /usr/share/openstack-tripleo-heat-templates \ ... -e /home/stack/custom_templates/stf-connectors.yaml \ ...
Additional resources
For information about validating the deployment, see Section 3.3.3, “Validating client-side installation”.
4.6.5. Querying metrics data from multiple clouds
Data stored in Prometheus has a service label attached according to the Smart Gateway it was scraped from. You can use this label to query data from a specific cloud.
To query data from a specific cloud, use a Prometheus promql
query that matches the associated service label; for example: collectd_uptime{service="default-cloud1-coll-meter-smartgateway"}
.
4.7. Ephemeral storage
Use ephemeral storage to run Service Telemetry Framework (STF) without persistently storing data in your Red Hat OpenShift Container Platform (OCP) cluster. Ephemeral storage is not recommended in a production environment due to the volatility of the data in the platform when operating correctly and as designed. For example, restarting a pod or rescheduling the workload to another node results in the loss of any local data written since the pod started.
4.7.1. Configuring ephemeral storage
To configure STF components for ephemeral storage, add ...storage.strategy: ephemeral
to the corresponding parameter. For example, to enable ephemeral storage for the Prometheus backend, set backends.metrics.prometheus.storage.strategy: ephemeral
. Components that support configuration of ephemeral storage include alerting.alertmanager
, backends.metrics.prometheus
, and backends.events.elasticsearch
. You can add ephemeral storage configuration at installation time or, if you already deployed STF, complete the following steps:
Procedure
- Log in to Red Hat OpenShift Container Platform.
Change to the
service-telemetry
namespace:$ oc project service-telemetry
Edit the ServiceTelemetry object:
$ oc edit stf default
Add the
...storage.strategy: ephemeral
parameter to thespec
section of the relevant component:apiVersion: infra.watch/v1beta1 kind: ServiceTelemetry metadata: name: stf-default namespace: service-telemetry spec: alerting: enabled: true alertmanager: storage: strategy: ephemeral backends: metrics: prometheus: enabled: true storage: strategy: ephemeral events: elasticsearch: enabled: true storage: strategy: ephemeral
- Save your changes and close the object.
4.8. Monitoring the resource usage of Red Hat OpenStack Platform services
Monitor the resource usage of the Red Hat OpenStack Platform services, such as the APIs and other infrastructure processes, to identify bottlenecks in the overcloud by showing services running out of compute power. Enable the collectd-libpod-stats
plug-in to gather CPU and memory usage metrics for every container running in the overcloud.
Prerequisites
-
You have created the
stf-connectors.yaml
file. For more information, see Section 3.3, “Configuring Red Hat OpenStack Platform overcloud for Service Telemetry Framework”. - You are using the most current version of Red Hat OpenStack Platform: 16.1.
Procedure
-
Open the
stf-connectors.yaml
file. Add the following configuration to
parameter_defaults
:CollectdEnableLibpodstats: true
- Continue with the overcloud deployment procedure.