OpenShift Container Platform 4 - Cluster Logging unstable and elasticsearch become unavailable frequently
Issue
-
In Cluster Logging with 6 or more
elasticsearch
Cluster members we suddenly noticed that oneelasticsearch
pod/container was reporting not ready state. When checking details, we found that it was heavily loaded with regards to CPU (44 Load Avg over 1 Minute - while the other pods were at 4 Load Avg over 1 Minute). When checking the logs we found the below:2021-10-28T06:31:22.034446019Z [2021-10-28T06:31:22,034][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336651] overhead, spent [282ms] collecting in the last [1s] 2021-10-28T06:31:53.958301553Z [2021-10-28T06:31:53,933][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336682] overhead, spent [263ms] collecting in the last [1s] 2021-10-28T06:33:56.385598998Z [2021-10-28T06:33:56,377][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336801] overhead, spent [265ms] collecting in the last [1s] 2021-10-28T06:35:13.681730622Z [2021-10-28T06:35:13,681][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336876] overhead, spent [516ms] collecting in the last [1s] 2021-10-28T06:35:38.848591269Z [2021-10-28T06:35:38,829][WARN ][r.suppressed ] [elasticsearch-cd-a76lica6-2] path: /_prometheus/metrics, params: {} 2021-10-28T06:35:38.848591269Z org.elasticsearch.ElasticsearchException: Search request failed 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.TransportNodePrometheusMetricsAction$AsyncAction$1.onFailure(TransportNodePrometheusMetricsAction.java:160) ~[?:?] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:238) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:296) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:91) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.AbstractRunnable.onRejection(AbstractRunnable.java:63) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.TimedRunnable.onRejection(TimedRunnable.java:50) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onRejection(ThreadContext.java:741) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:104) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.execute(AbstractSearchAsyncAction.java:311) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.FetchSearchPhase.run(FetchSearchPhase.java:80) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:165) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:159) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase.successfulShardExecution(InitialSearchPhase.java:254) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase.onShardResult(InitialSearchPhase.java:242) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$innerOnResponse$0(InitialSearchPhase.java:215) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:187) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?] 2021-10-28T06:35:38.848591269Z at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?] 2021-10-28T06:35:38.848591269Z at java.lang.Thread.run(Thread.java:829) ~[?:?] 2021-10-28T06:35:38.848591269Z Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: 2021-10-28T06:35:38.848591269Z ... 23 more 2021-10-28T06:35:38.848591269Z Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5ab14b9 on QueueResizingEsThreadPoolExecutor[name = elasticsearch-cd-a76lica6-2/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.8s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@e87b0fa[Running, pool size = 19, active threads = 19, queued tasks = 1124, completed tasks = 23997624]] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?] 2021-10-28T06:35:38.848591269Z at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?] 2021-10-28T06:35:38.848591269Z at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007] 2021-10-28T06:35:38.848591269Z ... 17 more
Environment
- Red Hat OpenShift Container Platform 4
- Red Hat OpenShift Logging 5.2
Subscriber exclusive content
A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.