Chapter 15. Troubleshooting monitoring issues
Find troubleshooting steps for common issues with core platform and user-defined project monitoring.
15.1. Investigating why user-defined project metrics are unavailable
ServiceMonitor resources enable you to determine how to use the metrics exposed by a service in user-defined projects. Follow the steps outlined in this procedure if you have created a ServiceMonitor resource but cannot see any corresponding metrics in the Metrics UI.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admincluster role. -
You have installed the OpenShift CLI (
oc). - You have enabled and configured monitoring for user-defined workloads.
-
You have created the
user-workload-monitoring-configConfigMapobject. -
You have created a
ServiceMonitorresource.
Procedure
Check that the corresponding labels match in the service and
ServiceMonitorresource configurations.Obtain the label defined in the service. The following example queries the
prometheus-example-appservice in thens1project:$ oc -n ns1 get service prometheus-example-app -o yaml
Example output
labels: app: prometheus-example-appCheck that the
matchLabelsapplabel in theServiceMonitorresource configuration matches the label output in the preceding step:$ oc -n ns1 get servicemonitor prometheus-example-monitor -o yaml
Example output
spec: endpoints: - interval: 30s port: web scheme: http selector: matchLabels: app: prometheus-example-appNoteYou can check service and
ServiceMonitorresource labels as a developer with view permissions for the project.
Inspect the logs for the Prometheus Operator in the
openshift-user-workload-monitoringproject.List the pods in the
openshift-user-workload-monitoringproject:$ oc -n openshift-user-workload-monitoring get pods
Example output
NAME READY STATUS RESTARTS AGE prometheus-operator-776fcbbd56-2nbfm 2/2 Running 0 132m prometheus-user-workload-0 5/5 Running 1 132m prometheus-user-workload-1 5/5 Running 1 132m thanos-ruler-user-workload-0 3/3 Running 0 132m thanos-ruler-user-workload-1 3/3 Running 0 132m
Obtain the logs from the
prometheus-operatorcontainer in theprometheus-operatorpod. In the following example, the pod is calledprometheus-operator-776fcbbd56-2nbfm:$ oc -n openshift-user-workload-monitoring logs prometheus-operator-776fcbbd56-2nbfm -c prometheus-operator
If there is a issue with the service monitor, the logs might include an error similar to this example:
level=warn ts=2020-08-10T11:48:20.906739623Z caller=operator.go:1829 component=prometheusoperator msg="skipping servicemonitor" error="it accesses file system via bearer token file which Prometheus specification prohibits" servicemonitor=eagle/eagle namespace=openshift-user-workload-monitoring prometheus=user-workload
Review the target status for your endpoint on the Metrics targets page in the OpenShift Container Platform web console UI.
- Log in to the OpenShift Container Platform web console and navigate to Observe → Targets in the Administrator perspective.
- Locate the metrics endpoint in the list, and review the status of the target in the Status column.
- If the Status is Down, click the URL for the endpoint to view more information on the Target Details page for that metrics target.
Configure debug level logging for the Prometheus Operator in the
openshift-user-workload-monitoringproject.Edit the
user-workload-monitoring-configConfigMapobject in theopenshift-user-workload-monitoringproject:$ oc -n openshift-user-workload-monitoring edit configmap user-workload-monitoring-config
Add
logLevel: debugforprometheusOperatorunderdata/config.yamlto set the log level todebug:apiVersion: v1 kind: ConfigMap metadata: name: user-workload-monitoring-config namespace: openshift-user-workload-monitoring data: config.yaml: | prometheusOperator: logLevel: debugSave the file to apply the changes.
NoteThe
prometheus-operatorin theopenshift-user-workload-monitoringproject restarts automatically when you apply the log-level change.Confirm that the
debuglog-level has been applied to theprometheus-operatordeployment in theopenshift-user-workload-monitoringproject:$ oc -n openshift-user-workload-monitoring get deploy prometheus-operator -o yaml | grep "log-level"
Example output
- --log-level=debug
Debug level logging will show all calls made by the Prometheus Operator.
Check that the
prometheus-operatorpod is running:$ oc -n openshift-user-workload-monitoring get pods
NoteIf an unrecognized Prometheus Operator
loglevelvalue is included in the config map, theprometheus-operatorpod might not restart successfully.-
Review the debug logs to see if the Prometheus Operator is using the
ServiceMonitorresource. Review the logs for other related errors.
Additional resources
- Creating a user-defined workload monitoring config map
-
See Specifying how a service is monitored for details on how to create a
ServiceMonitororPodMonitorresource - See Getting detailed information about metrics targets
15.2. Determining why Prometheus is consuming a lot of disk space
Developers can create labels to define attributes for metrics in the form of key-value pairs. The number of potential key-value pairs corresponds to the number of possible values for an attribute. An attribute that has an unlimited number of potential values is called an unbound attribute. For example, a customer_id attribute is unbound because it has an infinite number of possible values.
Every assigned key-value pair has a unique time series. The use of many unbound attributes in labels can result in an exponential increase in the number of time series created. This can impact Prometheus performance and can consume a lot of disk space.
You can use the following measures when Prometheus consumes a lot of disk:
- Check the number of scrape samples that are being collected.
- Check the time series database (TSDB) status using the Prometheus HTTP API for more information about which labels are creating the most time series. Doing so requires cluster administrator privileges.
Reduce the number of unique time series that are created by reducing the number of unbound attributes that are assigned to user-defined metrics.
NoteUsing attributes that are bound to a limited set of possible values reduces the number of potential key-value pair combinations.
- Enforce limits on the number of samples that can be scraped across user-defined projects. This requires cluster administrator privileges.
Prerequisites
-
You have access to the cluster as a user with the
cluster-admincluster role. -
You have installed the OpenShift CLI (
oc).
Procedure
- In the Administrator perspective, navigate to Observe → Metrics.
Run the following Prometheus Query Language (PromQL) query in the Expression field. This returns the ten metrics that have the highest number of scrape samples:
topk(10,count by (job)({__name__=~".+"}))Investigate the number of unbound label values assigned to metrics with higher than expected scrape sample counts.
- If the metrics relate to a user-defined project, review the metrics key-value pairs assigned to your workload. These are implemented through Prometheus client libraries at the application level. Try to limit the number of unbound attributes referenced in your labels.
- If the metrics relate to a core OpenShift Container Platform project, create a Red Hat support case on the Red Hat Customer Portal.
Review the TSDB status using the Prometheus HTTP API by running the following commands as a cluster administrator:
$ oc login -u <username> -p <password>
$ host=$(oc -n openshift-monitoring get route prometheus-k8s -ojsonpath={.spec.host})$ token=$(oc whoami -t)
$ curl -H "Authorization: Bearer $token" -k "https://$host/api/v1/status/tsdb"
Example output
"status": "success",
Additional resources
- See Setting a scrape sample limit for user-defined projects for details on how to set a scrape sample limit and create related alerting rules
- Submitting a support case