OpenShift Container Platform 4 - Cluster Logging unstable and elasticsearch become unavailable frequently

Solution In Progress - Updated -

Issue

  • In Cluster Logging with 6 or more elasticsearch Cluster members we suddenly noticed that one elasticsearch pod/container was reporting not ready state. When checking details, we found that it was heavily loaded with regards to CPU (44 Load Avg over 1 Minute - while the other pods were at 4 Load Avg over 1 Minute). When checking the logs we found the below:

    2021-10-28T06:31:22.034446019Z [2021-10-28T06:31:22,034][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336651] overhead, spent [282ms] collecting in the last [1s]
    2021-10-28T06:31:53.958301553Z [2021-10-28T06:31:53,933][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336682] overhead, spent [263ms] collecting in the last [1s]
    2021-10-28T06:33:56.385598998Z [2021-10-28T06:33:56,377][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336801] overhead, spent [265ms] collecting in the last [1s]
    2021-10-28T06:35:13.681730622Z [2021-10-28T06:35:13,681][INFO ][o.e.m.j.JvmGcMonitorService] [gc][1336876] overhead, spent [516ms] collecting in the last [1s]
    2021-10-28T06:35:38.848591269Z [2021-10-28T06:35:38,829][WARN ][r.suppressed             ] [elasticsearch-cd-a76lica6-2] path: /_prometheus/metrics, params: {}
    2021-10-28T06:35:38.848591269Z org.elasticsearch.ElasticsearchException: Search request failed
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.TransportNodePrometheusMetricsAction$AsyncAction$1.onFailure(TransportNodePrometheusMetricsAction.java:160) ~[?:?]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.support.TransportAction$1.onFailure(TransportAction.java:91) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.AbstractSearchAsyncAction.raisePhaseFailure(AbstractSearchAsyncAction.java:238) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseFailure(AbstractSearchAsyncAction.java:296) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.FetchSearchPhase$1.onFailure(FetchSearchPhase.java:91) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.AbstractRunnable.onRejection(AbstractRunnable.java:63) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.TimedRunnable.onRejection(TimedRunnable.java:50) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.onRejection(ThreadContext.java:741) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:104) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.AbstractSearchAsyncAction.execute(AbstractSearchAsyncAction.java:311) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.FetchSearchPhase.run(FetchSearchPhase.java:80) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:165) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:159) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.AbstractSearchAsyncAction.onPhaseDone(AbstractSearchAsyncAction.java:259) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.InitialSearchPhase.successfulShardExecution(InitialSearchPhase.java:254) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.InitialSearchPhase.onShardResult(InitialSearchPhase.java:242) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.InitialSearchPhase.access$200(InitialSearchPhase.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.InitialSearchPhase$2.lambda$innerOnResponse$0(InitialSearchPhase.java:215) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.action.search.InitialSearchPhase$1.doRun(InitialSearchPhase.java:187) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:41) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:751) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) ~[?:?]
    2021-10-28T06:35:38.848591269Z  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) ~[?:?]
    2021-10-28T06:35:38.848591269Z  at java.lang.Thread.run(Thread.java:829) ~[?:?]
    2021-10-28T06:35:38.848591269Z Caused by: org.elasticsearch.action.search.SearchPhaseExecutionException: 
    2021-10-28T06:35:38.848591269Z  ... 23 more
    2021-10-28T06:35:38.848591269Z Caused by: org.elasticsearch.common.util.concurrent.EsRejectedExecutionException: rejected execution of org.elasticsearch.common.util.concurrent.TimedRunnable@5ab14b9 on QueueResizingEsThreadPoolExecutor[name = elasticsearch-cd-a76lica6-2/search, queue capacity = 1000, min queue capacity = 1000, max queue capacity = 1000, frame size = 2000, targeted response rate = 1s, task execution EWMA = 5.8s, adjustment amount = 50, org.elasticsearch.common.util.concurrent.QueueResizingEsThreadPoolExecutor@e87b0fa[Running, pool size = 19, active threads = 19, queued tasks = 1124, completed tasks = 23997624]]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.EsAbortPolicy.rejectedExecution(EsAbortPolicy.java:48) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  at java.util.concurrent.ThreadPoolExecutor.reject(ThreadPoolExecutor.java:825) ~[?:?]
    2021-10-28T06:35:38.848591269Z  at java.util.concurrent.ThreadPoolExecutor.execute(ThreadPoolExecutor.java:1355) ~[?:?]
    2021-10-28T06:35:38.848591269Z  at org.elasticsearch.common.util.concurrent.EsThreadPoolExecutor.execute(EsThreadPoolExecutor.java:98) ~[elasticsearch-6.8.1.jar:6.8.1.redhat-00007]
    2021-10-28T06:35:38.848591269Z  ... 17 more
    

Environment

  • Red Hat OpenShift Container Platform 4
  • Red Hat OpenShift Logging 5.2

Subscriber exclusive content

A Red Hat subscription provides unlimited access to our knowledgebase, tools, and much more.

Current Customers and Partners

Log in for full access

Log In

New to Red Hat?

Learn more about Red Hat subscriptions

Using a Red Hat product through a public cloud?

How to access this content