HTTP status 500 Internal Server Error for Prometheus endpoints

Environment

OpenShift Container Platform
- 4.1.0
- 4.1.4
- 4.1.7

Issue

Prometheus is throwing a "server returned HTTP status 500 Internal Server Error".
This is happening on one or more nodes when attempting to access monitoring targets or endpoints that should normally be accessible.
Looking at journalctl on one of the failing nodes, we can see errors related to metrics collection:

10 error(s) occurred:
* collected metric kubelet_container_log_filesystem_used_bytes label:<name:"container" value:"dns" > label:<name:"namespace" value:"openshift-dns" > label:<name:"pod" value:"dns-default-dpvdr" > gauge:<value:12288 >  was collected before with the same name and label values
* collected metric kubelet_container_log_filesystem_used_bytes label:<name:"container" value:"machine-config-daemon" > label:<name:"namespace" value:"openshift-machine-config-operator" > label:<name:"pod" value:"machine-config-daemon-7clnj" > gauge:<value:12288 >  was collected before with the same name and label values

Resolution

Upgrade to OpenShift Container Platform 4.1.2 or higher, where this is marked as fixed, before attempting further troubleshooting. A very similar issue observed in 4.1.4 and 4.1.7 was addressed and fixed in 4.1.9 as well.

Root Cause

There is a bug report for 4.1.0 specifically concerning Prometheus endpoints and 500 errors. There are backports as well as upstream reports available in the bug report.

Diagnostic Steps

Deploy OCP 4.1.0 on vmware/bare metal environment with monitoring enabled.
Let the cluster run for a few days.
Load the Prometheus page, and observe the error.

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Ansible.com

Red Hat Ecosystem Catalog

Red Hat Hybrid Cloud Console

Red Hat Store

Red Hat Marketplace

Red Hat Summit and AnsibleFest

HTTP status 500 Internal Server Error for Prometheus endpoints

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Environment

Issue

Resolution

Root Cause

Diagnostic Steps

Comments

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links