Elasticsearch operator getting timed out while connecting to Elasticsearch cluster
Environment
- Red Hat OpenShift Container Platform (RHOCP) 4
Issue
-
Elasticsearch
Operator gettingtimed out
while connecting toElasticsearch
cluster:$ oc -n openshift-operators-redhat <elasticsearch operator pod> "level":"error","ts":1617202631.8834436,"logger":"elasticsearch-operator","caller":"k8shandler/index_management.go:73","msg":"Unable to list existing templates in order to reconcile stale ones","error":"Get \https://elasticsearch.openshift-logging.svc:9200/_template\: dial tcp 10.x.x.x:9200: i/o timeout"} {"level":"error","ts":1617202661.884272,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"failed to create index template","mapping":"app","error":"failed decoding raw response body into `map[string]estypes.GetIndexTemplate` for elasticsearch in namespace openshift-logging: unexpected end of JSON input"} {"level":"error","ts":1617202693.0506203,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:58","msg":"Unable to clear transient shard allocation","cluster":"elasticsearch","namespace":"openshift-logging","error":"Response: , Error: Put \https://elasticsearch.openshift-logging.svc:9200/_cluster/settings\: dial tcp 10.x.x.x:9200: i/o timeout"}
-
The rollover
cronjobs
arefailing
with the below error.$ oc -n openshift-logging logs <index management pod job> ... Index management delete process starting Error while attemping to determine the active write alias: {'error': 'alias [app-write] missing', 'status': 404} Index management rollover process starting Error while attemping to determine the active write alias: {'error': 'alias [app-write] missing', 'status': 404}
-
Kibana is displaying logs with hours delay, Kibana is responding very slowly, making log analysis almost impossible
- Some of the indices in Elasticsearch are over 500GB in size
Resolution
One inappropriate Network Policy
applied on the namespace level
in openshift-logging
or openshift-operators-redhat
could lead to a connection failure from the Elasticsearch Operator pod and the Elasticsearch service. Then, it's needed to review these network policies and modify/delete them.
Root Cause
-
The Elasticsearch Operator is in charge of configuring the first time everything for the Elasticsearch cluster. Then, if the Elasticsearch Operator is not able to reach the Elasticsearch service, it won't be able to generate the appropriate aliases the first time for the app, infra and audit indices.
-
The lack of communication is resulting the cronjobs of Elasticsearch to be on pause. In this case the indices are not deleted or rotated, making them over 500GB of size. The reason for the blocked communication is a custom Network Policy or a custom Service Mesh configuration.
Diagnostic Steps
Check for the CSV version of the Elasticsearch Operator and ClusterLogging operator are in the same version
$ oc -n openshift-logging get csv
Try to reach the Elasticsearch service through the Elasticsearch operator pod. If it did not work, check from the node level.
$ oc -n openshift-operators-redhat rsh <es_operator_pod>
$ curl -v -k elasticsearch.openshift-logging.svc:9200
Try to reach the ES service
from the node where the ES operator
is running.
$ oc rsh <fluentd_pod>
$ curl -v -k elasticsearch.openshift-logging.svc:9200
If in both cases, it fails to connect to ES service
, Check for any Network Policies
applied on namespace levels.
$ oc -n openshift-logging get networkpolicies
$ oc -n openshift-operators-redhat get networkpolicies
Example log output of the Elasticsearch Operator, pointing to the lack of communication:
$oc logs elasticsearch-operator-6fd4cbd8dd-9fpqs -c elasticsearch-operator
2022-02-14T11:41:18.744329002Z {"_ts":"2022-02 14T11:41:18.744275531Z","_level":"0","_component":"elasticsearch-operator_controller_elasticsearch-controller","_message":"Reconciler error","_error":{"msg":"failed decoding raw response body into `map[string]estypes.GetIndexTemplate` for elasticsearch in namespace openshift-logging: unexpected end of JSON input"},"name":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:41:49.669333708Z {"_ts":"2022-02-14T11:41:49.669115485Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to clear transient shard allocation","_error":{"cause":{"Op":"Put","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"cluster":"elasticsearch","msg":"failed to clear shard allocation","namespace":"openshift-logging","response":""},"cluster":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:43:19.675480205Z {"_ts":"2022-02-14T11:43:19.675268168Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to check if threshold is enabled","cluster":"elasticsearch","error":{"Op":"Get","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"namespace":"openshift-logging"}
2022-02-14T11:43:49.745517779Z {"_ts":"2022-02-14T11:43:49.745430632Z","_level":"0","_component":"elasticsearch-operator","_message":"failed to get LowestClusterVersion","_error":{"msg":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all\": dial tcp xxx.xxx.xxx.xxx:9200: i/o timeout"},"cluster":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:45:19.751047448Z {"_ts":"2022-02-14T11:45:19.750805926Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to check if threshold is enabled","cluster":"elasticsearch","error":{"Op":"Get","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"namespace":"openshift-logging"}
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments