Prometheus pods unable to see targets in OCP 3

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 3.11
  • Red Hat OpenShift Container Storage (RHOCS)
    • 3.11

Issue

  • Metrics are not seen in Grafana
  • The node-exporter pods seem to be working fine but targets are seen down from Prometheus pods
  • Query of metrics targets from Prometheus pods produces the error WAL log samples: log series: write /prometheus/wal/xxyyzz: transport endpoint is not connected where xxyyzz is the wal number.

Resolution

Delete the prometheus pod that is unable to read the metric targets.

  1. Identify the prometheus pods and the nodes on which they are running:

    $ oc get pods -l app=prometheus -n openshift-monitoring -o wide
    NAME               READY     STATUS    RESTARTS   AGE       IP           NODE                  NOMINATED NODE
    prometheus-k8s-0   4/4       Running   1          1h        10.131.0.3   infra-0.example.com   <none>
    prometheus-k8s-1   4/4       Running   1          1h        10.130.0.5   infra-1.example.com   <none>
    
  2. Delete the prometheus pod that is unable to read the metric targets. If for example pod prometheus-k8s-1 is unable to read the metric targets producing the error (as described in the simulation shown in Diagnostic Steps section):

    $ oc delete pod prometheus-k8s-1 -n openshift-monitoring
    
  3. The pod will be re-created by the prometheus-operator:

    $ oc get pods -l app=prometheus -n openshift-monitoring -o wide
    NAME               READY     STATUS    RESTARTS   AGE       IP           NODE                  NOMINATED NODE
    prometheus-k8s-0   4/4       Running   1          1h        10.131.0.3   infra-0.example.com   <none>
    prometheus-k8s-1   4/4       Running   1          30s       10.130.0.6   infra-1.example.com   <none>
    

Root Cause

Prometheus pods are typically scheduled on two different infra nodes in the OCP cluster. If one of those pods is able to see the metrics targets (i.e. node-exporter is properly working and reachable), but the other pod is unable to see the targets up then the problem is likely to be in the latter prometheus pod.
In the simulation shown in the Diagnostic Steps section of this article the error message indicates that the pod (and by extension the node on which it was scheduled) was unable to write to the storage volume. The message transport endpoint is not connected hints that the gluster volume is not properly mounted. In this simulation the condition originated from a gluster node being rebooted, and that node happened to be the same one referenced by the volume mounted in the pod prometheus-k8s-1.

Diagnostic Steps

  1. Check that all nodes have a node-exporter pod running and ready. For example:

    $ oc get pods -l app=node-exporter -n openshift-monitoring -o wide
    NAME                  READY     STATUS    RESTARTS   AGE       IP            NODE                   NOMINATED NODE
    node-exporter-kznvl   2/2       Running   0          1h        10.0.91.11    master-0.example.com   <none>
    node-exporter-cxvlf   2/2       Running   0          1h        10.0.88.69    infra-0.example.com    <none>
    node-exporter-gr6bd   2/2       Running   0          1h        10.0.89.16    infra-1.example.com    <none>
    node-exporter-66bh4   2/2       Running   0          1h        10.0.90.49    infra-2.example.com    <none>
    node-exporter-qgscg   2/2       Running   0          1h        10.0.94.251   node-0.example.com     <none>
    node-exporter-h6fnk   2/2       Running   0          1h        10.0.94.23    node-1.example.com     <none>
    node-exporter-m2fvq   2/2       Running   0          1h        10.0.90.211   node-2.example.com     <none>
    
  2. From the bastion host or from the master node, check the targets as seen from the Prometheus pods. Export the output in json format to files:

    $ oc exec prometheus-k8s-0  -c prometheus -n openshift-monitoring  -- curl http://localhost:9090/api/v1/targets > prometheus-k8s-0_targets.out
    $ oc exec prometheus-k8s-1  -c prometheus -n openshift-monitoring  -- curl http://localhost:9090/api/v1/targets > prometheus-k8s-1_targets.out
    
  3. From a node where the jq tool is installed, parse the output files to check the status of the targets.

    $ cat prometheus-k8s-0_targets.out | jq -r '.data.activeTargets[] |.scrapeUrl+" "+.health+" "+.lastError'
    
    https://10.131.0.3:9091/metrics up 
    https://10.130.0.5:9091/metrics up 
    https://10.130.0.11:8443/metrics up 
    https://10.0.89.16:10250/metrics/cadvisor up 
    https://10.0.90.49:10250/metrics/cadvisor up 
    https://10.0.91.11:10250/metrics/cadvisor up 
    https://10.0.94.251:10250/metrics/cadvisor up 
    https://10.0.94.23:10250/metrics/cadvisor up 
    https://10.0.90.211:10250/metrics/cadvisor up 
    https://10.0.88.69:10250/metrics/cadvisor up 
    https://10.0.90.211:10250/metrics up 
    https://10.0.88.69:10250/metrics up 
    https://10.0.89.16:10250/metrics up 
    https://10.0.90.49:10250/metrics up 
    https://10.0.91.11:10250/metrics up 
    https://10.0.94.251:10250/metrics up 
    https://10.0.94.23:10250/metrics up 
    https://10.130.0.11:9443/metrics up 
    http://10.130.0.4:8080/metrics up 
    https://10.0.91.11:8444/metrics up 
    https://10.0.91.11:443/metrics up 
    http://10.131.0.10:8080/metrics up 
    https://10.0.91.11:9100/metrics up 
    https://10.0.94.23:9100/metrics up 
    https://10.0.94.251:9100/metrics up 
    https://10.0.88.69:9100/metrics up 
    https://10.0.89.16:9100/metrics up 
    https://10.0.90.211:9100/metrics up 
    https://10.0.90.49:9100/metrics up 
    https://10.131.0.11:9094/metrics up 
    https://10.130.0.6:9094/metrics up 
    https://10.131.0.4:9094/metrics up
    
    

The output above is what is expected. One can see targets up for every node (ports 10250, 9100), master (ports 8444, 443), prometheus-k8s-0 and prometheus-k8s-1 pods (port 9091), kube-state-metrics pod (ports 8443, 9443), prometheus-operator (port 8080), cluster-monitoring-operator (port 8080), alertmanager-main-{0,1,2} pods (port 9094).

By parsing the output file corresponding to the other prometheus pod we may notice that targets are seen down with an error message:

$ cat prometheus-k8s-1_targets.out | jq -r '.data.activeTargets[] |.scrapeUrl+" "+.health+" "+.lastError'

https://10.131.0.3:9091/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.130.0.5:9091/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.130.0.11:8443/metrics down log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.89.16:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.49:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.251:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.23:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.211:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.88.69:10250/metrics/cadvisor down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.211:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.88.69:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.89.16:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.49:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.251:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.23:10250/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.130.0.11:9443/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
http://10.130.0.4:8080/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:8444/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:443/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
http://10.131.0.10:8080/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.91.11:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.23:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.94.251:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.88.69:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.89.16:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.211:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.0.90.49:9100/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.131.0.11:9094/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.130.0.6:9094/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected
https://10.131.0.4:9094/metrics down WAL log samples: log series: write /prometheus/wal/025642: transport endpoint is not connected

In this case the pod needs to be restarted as described in the Resolution section, because the error message indicates that the backing storage provided by OpenShift Container Storage or Gluster is not properly mounted. See the article Transport endpoint not connected for more details.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments