[OpenShift 3] Slow and failing OpenShift metrics
Issue
On growing developer cluster (around 300 namespaces, around 100 jenkins with dev pipelines for project on clusters) and alomost default installation of metrics (cassandra, hawcular), we are experiencing slow metrics (even up to minute waiting for graphs in openshift web console) and periodical (weeks) crashes of whole metrics.
We can also see BusyPoolException messages in pod-hawkular-metrics logs:
2020-12-03T16:07:18.059306593Z ^[[0m^[[31m2020-12-03 16:07:18,034 ERROR [org.hawkular.metrics.api.jaxrs.util.ApiUtils] (RxComputationScheduler-1) HAWKMETRICS200010: Failed to process request: java.util.concurrent.ExecutionException: com.datastax.driver.core.exceptions.NoHostAvailableException: All host(s) tried for query failed (tried: hawkular-cassandra/192.168.87.231:9042 (com.datastax.driver.core.exceptions.BusyPoolException: [hawkular-cassandra/192.168.87.231] Pool is busy (no available connection and the queue has reached its max size 256)))
Environment
- Red Hat OpenShift Container Platform (OCP) 3.11
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.