Prometheus pods unable to see targets in OCP 3
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 3.11
- Red Hat OpenShift Container Storage (RHOCS)
- 3.11
Issue
- Metrics are not seen in Grafana
- The
node-exporter
pods seem to be working fine but targets are seendown
from Prometheus pods - Query of metrics targets from Prometheus pods produces the error
WAL log samples: log series: write /prometheus/wal/xxyyzz: transport endpoint is not connected
wherexxyyzz
is the wal number.
Resolution
Delete the prometheus pod that is unable to read the metric targets.
-
Identify the prometheus pods and the nodes on which they are running:
$ oc get pods -l app=prometheus -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE prometheus-k8s-0 4/4 Running 1 1h 10.131.0.3 infra-0.example.com <none> prometheus-k8s-1 4/4 Running 1 1h 10.130.0.5 infra-1.example.com <none>
-
Delete the prometheus pod that is unable to read the metric targets. If for example pod
prometheus-k8s-1
is unable to read the metric targets producing the error (as described in the simulation shown in Diagnostic Steps section):$ oc delete pod prometheus-k8s-1 -n openshift-monitoring
-
The pod will be re-created by the prometheus-operator:
$ oc get pods -l app=prometheus -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE prometheus-k8s-0 4/4 Running 1 1h 10.131.0.3 infra-0.example.com <none> prometheus-k8s-1 4/4 Running 1 30s 10.130.0.6 infra-1.example.com <none>
Root Cause
Prometheus pods are typically scheduled on two different infra
nodes in the OCP cluster. If one of those pods is able to see the metrics targets (i.e. node-exporter
is properly working and reachable), but the other pod is unable to see the targets up then the problem is likely to be in the latter prometheus pod.
In the simulation shown in the Diagnostic Steps section of this article the error message indicates that the pod (and by extension the node on which it was scheduled) was unable to write to the storage volume. The message transport endpoint is not connected
hints that the gluster volume is not properly mounted. In this simulation the condition originated from a gluster node being rebooted, and that node happened to be the same one referenced by the volume mounted in the pod prometheus-k8s-1
.
Diagnostic Steps
-
Check that all nodes have a
node-exporter
pod running and ready. For example:$ oc get pods -l app=node-exporter -n openshift-monitoring -o wide NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE node-exporter-kznvl 2/2 Running 0 1h 10.0.91.11 master-0.example.com <none> node-exporter-cxvlf 2/2 Running 0 1h 10.0.88.69 infra-0.example.com <none> node-exporter-gr6bd 2/2 Running 0 1h 10.0.89.16 infra-1.example.com <none> node-exporter-66bh4 2/2 Running 0 1h 10.0.90.49 infra-2.example.com <none> node-exporter-qgscg 2/2 Running 0 1h 10.0.94.251 node-0.example.com <none> node-exporter-h6fnk 2/2 Running 0 1h 10.0.94.23 node-1.example.com <none> node-exporter-m2fvq 2/2 Running 0 1h 10.0.90.211 node-2.example.com <none>
-
From the bastion host or from the master node, check the targets as seen from the Prometheus pods. Export the output in
json
format to files:$ oc exec prometheus-k8s-0 -c prometheus -n openshift-monitoring -- curl http://localhost:9090/api/v1/targets > prometheus-k8s-0_targets.out $ oc exec prometheus-k8s-1 -c prometheus -n openshift-monitoring -- curl http://localhost:9090/api/v1/targets > prometheus-k8s-1_targets.out
-
From a node where the
jq
tool is installed, parse the output files to check the status of the targets.$ cat prometheus-k8s-0_targets.out | jq -r '.data.activeTargets[] |.scrapeUrl+" "+.health+" "+.lastError' https://10.131.0.3:9091/metrics up https://10.130.0.5:9091/metrics up https://10.130.0.11:8443/metrics up https://10.0.89.16:10250/metrics/cadvisor up https://10.0.90.49:10250/metrics/cadvisor up https://10.0.91.11:10250/metrics/cadvisor up https://10.0.94.251:10250/metrics/cadvisor up https://10.0.94.23:10250/metrics/cadvisor up https://10.0.90.211:10250/metrics/cadvisor up https://10.0.88.69:10250/metrics/cadvisor up https://10.0.90.211:10250/metrics up https://10.0.88.69:10250/metrics up https://10.0.89.16:10250/metrics up https://10.0.90.49:10250/metrics up https://10.0.91.11:10250/metrics up https://10.0.94.251:10250/metrics up https://10.0.94.23:10250/metrics up https://10.130.0.11:9443/metrics up http://10.130.0.4:8080/metrics up https://10.0.91.11:8444/metrics up https://10.0.91.11:443/metrics up http://10.131.0.10:8080/metrics up https://10.0.91.11:9100/metrics up https://10.0.94.23:9100/metrics up https://10.0.94.251:9100/metrics up https://10.0.88.69:9100/metrics up https://10.0.89.16:9100/metrics up https://10.0.90.211:9100/metrics up https://10.0.90.49:9100/metrics up https://10.131.0.11:9094/metrics up https://10.130.0.6:9094/metrics up https://10.131.0.4:9094/metrics up
The output above is what is expected. One can see targets up
for every node (ports 10250
, 9100
), master (ports 8444
, 443
), prometheus-k8s-0
and prometheus-k8s-1
pods (port 9091
), kube-state-metrics
pod (ports 8443
, 9443
), prometheus-operator
(port 8080
), cluster-monitoring-operator
(port 8080
), alertmanager-main-{0,1,2}
pods (port 9094
).
By parsing the output file corresponding to the other prometheus
pod we may notice that targets are seen down with an error message:
$ cat prometheus-k8s-1_targets.out | jq -r '.data.activeTargets[] |.scrapeUrl+" "+.health+" "+.lastError'
https://10.131.0.3:9091/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.130.0.5:9091/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.130.0.11:8443/metrics down log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.89.16:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.49:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.251:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.23:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.211:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.88.69:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.211:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.88.69:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.89.16:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.49:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.251:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.23:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.130.0.11:9443/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
http://10.130.0.4:8080/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:8444/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:443/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
http://10.131.0.10:8080/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.23:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.251:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.88.69:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.89.16:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.211:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.49:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.131.0.11:9094/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.130.0.6:9094/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.131.0.4:9094/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
In this case the pod needs to be restarted as described in the Resolution section, because the error message indicates that the backing storage provided by OpenShift Container Storage or Gluster is not properly mounted. See the article Transport endpoint not connected for more details.
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments