Elasticsearch operator getting timed out while connecting to Elasticsearch cluster

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform (RHOCP)
    • 4

Issue

  • Image inconsistency in Elasticsearch deployments - two deployments have same image, the third one has a different image
  • The Elasticsearch Operator is not able to rollout the Elasticsearch pods having different image versions:
  • Elasticsearch Operator getting timed out while connecting to Elasticsearch cluster:

    $ oc -n openshift-operators-redhat <elasticsearch operator pod>
    "level":"error","ts":1617202631.8834436,"logger":"elasticsearch-operator","caller":"k8shandler/index_management.go:73","msg":"Unable to list existing templates in order to reconcile stale ones","error":"Get \https://elasticsearch.openshift-logging.svc:9200/_template\: dial tcp 10.x.x.x:9200: i/o timeout"}
    {"level":"error","ts":1617202661.884272,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"failed to create index template","mapping":"app","error":"failed decoding raw response body into `map[string]estypes.GetIndexTemplate` for elasticsearch in namespace openshift-logging: unexpected end of JSON input"}
    {"level":"error","ts":1617202693.0506203,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:58","msg":"Unable to clear transient shard allocation","cluster":"elasticsearch","namespace":"openshift-logging","error":"Response: , Error: Put \https://elasticsearch.openshift-logging.svc:9200/_cluster/settings\: dial tcp 10.x.x.x:9200: i/o timeout"}
    
  • The rollover cronjobs are failing with the below error.

    $ oc -n openshift-logging logs <index management pod job>
    ...
    Index management delete process starting
    Error while attemping to determine the active write alias: {'error': 'alias [app-write] missing', 'status': 404}
    Index management rollover process starting
    Error while attemping to determine the active write alias: {'error': 'alias [app-write] missing', 'status': 404}
    
  • Kibana is displaying logs with hours delay, Kibana is responding very slowly, making log analysis almost impossible

  • Some of the indices in Elasticsearch are over 500GB in size

Resolution

One inappropriate Network Policy applied on the namespace level in openshift-logging or openshift-operators-redhat could lead to a connection failure from the Elasticsearch Operator pod and the Elasticsearch service. Then, it's needed to review these network policies and modify/delete them.

Step 1. List network policies in a namespace:

To view NetworkPolicy objects defined in a namespace, enter the following command:

$ oc get networkpolicy -n <namespace>

Step 2. To examine a specific network policy, enter the following command:

$ oc describe networkpolicy <policy_name> -n <namespace> 

Step 3. To delete a NetworkPolicy object, enter the following command:

$ oc delete networkpolicy <policy_name> -n <namespace>

Root Cause

  • The Elasticsearch Operator is in charge of configuring the first time everything for the Elasticsearch cluster. Then, if the Elasticsearch Operator is not able to reach the Elasticsearch service, it won't be able to generate the appropriate aliases the first time for the app, infra and audit indices.

  • The lack of communication is resulting the cronjobs of Elasticsearch to be on pause. In this case the indices are not deleted or rotated, making them over 500GB of size. The reason for the blocked communication is a custom Network Policy or a custom Service Mesh configuration.

Diagnostic Steps

Check for the CSV version of the Elasticsearch Operator and ClusterLogging operator are in the same version

$ oc -n openshift-logging get csv

Try to reach the Elasticsearch service through the Elasticsearch operator pod. If it did not work, check from the node level.

$ eo=$(oc get pod -l name=elasticsearch-operator -o name -n openshift-operators-redhat)
$ oc -n openshift-operators-redhat rsh $eo
$ curl -v -k elasticsearch.openshift-logging.svc:9200

Try to reach the ES service from the node where the ES operator is running.

$ oc -n openshift-logging rsh <collector pod> 
$ curl -v -k elasticsearch.openshift-logging.svc:9200

If in both cases, it fails to connect to ES service, Check for any Network Policies applied on namespace levels.

$ oc -n openshift-logging get networkpolicies
$ oc -n openshift-operators-redhat get networkpolicies

Example log output of the Elasticsearch Operator, pointing to the lack of communication:

$ eo=$(oc get pod -l name=elasticsearch-operator -o name -n openshift-operators-redhat)
$ oc logs $eo -n openshift-operators-redhat 
2022-02-14T11:41:18.744329002Z {"_ts":"2022-02 14T11:41:18.744275531Z","_level":"0","_component":"elasticsearch-operator_controller_elasticsearch-controller","_message":"Reconciler error","_error":{"msg":"failed decoding raw response body into `map[string]estypes.GetIndexTemplate` for elasticsearch in namespace openshift-logging: unexpected end of JSON input"},"name":"elasticsearch","namespace":"openshift-logging"}
  2022-02-14T11:41:49.669333708Z {"_ts":"2022-02-14T11:41:49.669115485Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to clear transient shard allocation","_error":{"cause":{"Op":"Put","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"cluster":"elasticsearch","msg":"failed to clear shard allocation","namespace":"openshift-logging","response":""},"cluster":"elasticsearch","namespace":"openshift-logging"}
  2022-02-14T11:43:19.675480205Z {"_ts":"2022-02-14T11:43:19.675268168Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to check if threshold is enabled","cluster":"elasticsearch","error":{"Op":"Get","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"namespace":"openshift-logging"}
  2022-02-14T11:43:49.745517779Z {"_ts":"2022-02-14T11:43:49.745430632Z","_level":"0","_component":"elasticsearch-operator","_message":"failed to get LowestClusterVersion","_error":{"msg":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all\": dial tcp xxx.xxx.xxx.xxx:9200: i/o timeout"},"cluster":"elasticsearch","namespace":"openshift-logging"}
  2022-02-14T11:45:19.751047448Z {"_ts":"2022-02-14T11:45:19.750805926Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to check if threshold is enabled","cluster":"elasticsearch","error":{"Op":"Get","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"namespace":"openshift-logging"}

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments