Elasticsearch operator getting timed out while connecting to Elasticsearch cluster
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
Issue
- Image inconsistency in Elasticsearch deployments - two deployments have same image, the third one has a different image
- The Elasticsearch Operator is not able to rollout the Elasticsearch pods having different image versions:
-
Elasticsearch
Operator gettingtimed out
while connecting toElasticsearch
cluster:$ oc -n openshift-operators-redhat <elasticsearch operator pod> "level":"error","ts":1617202631.8834436,"logger":"elasticsearch-operator","caller":"k8shandler/index_management.go:73","msg":"Unable to list existing templates in order to reconcile stale ones","error":"Get \https://elasticsearch.openshift-logging.svc:9200/_template\: dial tcp 10.x.x.x:9200: i/o timeout"} {"level":"error","ts":1617202661.884272,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"failed to create index template","mapping":"app","error":"failed decoding raw response body into `map[string]estypes.GetIndexTemplate` for elasticsearch in namespace openshift-logging: unexpected end of JSON input"} {"level":"error","ts":1617202693.0506203,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:58","msg":"Unable to clear transient shard allocation","cluster":"elasticsearch","namespace":"openshift-logging","error":"Response: , Error: Put \https://elasticsearch.openshift-logging.svc:9200/_cluster/settings\: dial tcp 10.x.x.x:9200: i/o timeout"}
-
The rollover
cronjobs
arefailing
with the below error.$ oc -n openshift-logging logs <index management pod job> ... Index management delete process starting Error while attemping to determine the active write alias: {'error': 'alias [app-write] missing', 'status': 404} Index management rollover process starting Error while attemping to determine the active write alias: {'error': 'alias [app-write] missing', 'status': 404}
-
Kibana is displaying logs with hours delay, Kibana is responding very slowly, making log analysis almost impossible
- Some of the indices in Elasticsearch are over 500GB in size
Resolution
One inappropriate Network Policy
applied on the namespace level
in openshift-logging
or openshift-operators-redhat
could lead to a connection failure from the Elasticsearch Operator pod and the Elasticsearch service. Then, it's needed to review these network policies and modify/delete them.
Step 1. List network policies in a namespace:
To view NetworkPolicy
objects defined in a namespace, enter the following command:
$ oc get networkpolicy -n <namespace>
Step 2. To examine a specific network policy, enter the following command:
$ oc describe networkpolicy <policy_name> -n <namespace>
Step 3. To delete a NetworkPolicy
object, enter the following command:
$ oc delete networkpolicy <policy_name> -n <namespace>
Root Cause
-
The Elasticsearch Operator is in charge of configuring the first time everything for the Elasticsearch cluster. Then, if the Elasticsearch Operator is not able to reach the Elasticsearch service, it won't be able to generate the appropriate aliases the first time for the app, infra and audit indices.
-
The lack of communication is resulting the cronjobs of Elasticsearch to be on pause. In this case the indices are not deleted or rotated, making them over 500GB of size. The reason for the blocked communication is a custom Network Policy or a custom Service Mesh configuration.
Diagnostic Steps
Check for the CSV version of the Elasticsearch Operator and ClusterLogging operator are in the same version
$ oc -n openshift-logging get csv
Try to reach the Elasticsearch service through the Elasticsearch operator pod. If it did not work, check from the node level.
$ eo=$(oc get pod -l name=elasticsearch-operator -o name -n openshift-operators-redhat)
$ oc -n openshift-operators-redhat rsh $eo
$ curl -v -k elasticsearch.openshift-logging.svc:9200
Try to reach the ES service
from the node where the ES operator
is running.
$ oc -n openshift-logging rsh <collector pod>
$ curl -v -k elasticsearch.openshift-logging.svc:9200
If in both cases, it fails to connect to ES service
, Check for any Network Policies
applied on namespace levels.
$ oc -n openshift-logging get networkpolicies
$ oc -n openshift-operators-redhat get networkpolicies
Example log output of the Elasticsearch Operator, pointing to the lack of communication:
$ eo=$(oc get pod -l name=elasticsearch-operator -o name -n openshift-operators-redhat)
$ oc logs $eo -n openshift-operators-redhat
2022-02-14T11:41:18.744329002Z {"_ts":"2022-02 14T11:41:18.744275531Z","_level":"0","_component":"elasticsearch-operator_controller_elasticsearch-controller","_message":"Reconciler error","_error":{"msg":"failed decoding raw response body into `map[string]estypes.GetIndexTemplate` for elasticsearch in namespace openshift-logging: unexpected end of JSON input"},"name":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:41:49.669333708Z {"_ts":"2022-02-14T11:41:49.669115485Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to clear transient shard allocation","_error":{"cause":{"Op":"Put","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"cluster":"elasticsearch","msg":"failed to clear shard allocation","namespace":"openshift-logging","response":""},"cluster":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:43:19.675480205Z {"_ts":"2022-02-14T11:43:19.675268168Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to check if threshold is enabled","cluster":"elasticsearch","error":{"Op":"Get","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"namespace":"openshift-logging"}
2022-02-14T11:43:49.745517779Z {"_ts":"2022-02-14T11:43:49.745430632Z","_level":"0","_component":"elasticsearch-operator","_message":"failed to get LowestClusterVersion","_error":{"msg":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all\": dial tcp xxx.xxx.xxx.xxx:9200: i/o timeout"},"cluster":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:45:19.751047448Z {"_ts":"2022-02-14T11:45:19.750805926Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to check if threshold is enabled","cluster":"elasticsearch","error":{"Op":"Get","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"namespace":"openshift-logging"}
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments