当连接到Elasticsearch集群时Elasticsearch operator日志显示 timed out
Environment
- Red Hat OpenShift Container Platform (RHOCP)
- 4
Issue
-
当
Elasticsearch
Operator 连接到Elasticsearch
集群时,日志显示timed out
:$ oc -n openshift-operators-redhat <elasticsearch operator pod> "level":"error","ts":1617202631.8834436,"logger":"elasticsearch-operator","caller":"k8shandler/index_management.go:73","msg":"Unable to list existing templates in order to reconcile stale ones","error":"Get \https://elasticsearch.openshift-logging.svc:9200/_template\: dial tcp 10.x.x.x:9200: i/o timeout"} {"level":"error","ts":1617202661.884272,"logger":"elasticsearch-operator","caller":"k8shandler/reconciler.go:80","msg":"failed to create index template","mapping":"app","error":"failed decoding raw response body into `map[string]estypes.GetIndexTemplate` for elasticsearch in namespace openshift-logging: unexpected end of JSON input"} {"level":"error","ts":1617202693.0506203,"logger":"elasticsearch-operator","caller":"k8shandler/cluster.go:58","msg":"Unable to clear transient shard allocation","cluster":"elasticsearch","namespace":"openshift-logging","error":"Response: , Error: Put \https://elasticsearch.openshift-logging.svc:9200/_cluster/settings\: dial tcp 10.x.x.x:9200: i/o timeout"}
-
cronjobs 失败出现以下错误.
$ oc -n openshift-logging logs <index management pod job> ... Index management delete process starting Error while attemping to determine the active write alias: {'error': 'alias [app-write] missing', 'status': 404} Index management rollover process starting Error while attemping to determine the active write alias: {'error': 'alias [app-write] missing', 'status': 404}
-
Kibana 正在显示延迟数小时的日志,Kibana 响应非常缓慢,几乎无法进行日志分析
- Elasticsearch 中的一些索引大小超过 500GB
Resolution
不合理的 Network Policy
应用在 namespace level
上 openshift-logging
或者 openshift-operators-redhat
命名空间可能会导致无法连接到。需要查看这些Network Policy
修改/删除它们。
Root Cause
- Elasticsearch Operator 负责首次配置 Elasticsearch 集群的所有内容。 然后,如果 Elasticsearch Operator 无法访问 Elasticsearch 服务,它将无法在第一时间为应用程序、基础设施和审计索引生成适当的别名。
- 缺乏沟通导致 Elasticsearch 的 cronjobs 暂停。 在这种情况下,索引不会被删除或旋转,从而使它们的大小超过 500GB。 通信被阻止的原因是自定义网络策略或自定义服务网格配置。
Diagnostic Steps
查看Elasticsearch Operator和 ClusterLogging Operator的是否是相同版本:
$ oc -n openshift-logging get csv
尝试通过 Elasticsearch operator pod 访问 Elasticsearch 服务。如果它不起作用,请从节点级别检查。
$ oc -n openshift-operators-redhat rsh <es_operator_pod>
$ curl -v -k elasticsearch.openshift-logging.svc:9200
尝试从运行 ES operator
的节点访问 ES service
。
$ oc rsh <fluentd_pod>
$ curl -v -k elasticsearch.openshift-logging.svc:9200
如果在这两种情况下,它都无法连接到ES service
,请检查在命名空间级别上应用的任何Network Policies
。
$ oc -n openshift-logging get networkpolicies
$ oc -n openshift-operators-redhat get networkpolicies
Elasticsearch Operator 的示例日志输出,暗示缺乏通信:
$oc logs elasticsearch-operator-6fd4cbd8dd-9fpqs -c elasticsearch-operator
2022-02-14T11:41:18.744329002Z {"_ts":"2022-02 14T11:41:18.744275531Z","_level":"0","_component":"elasticsearch-operator_controller_elasticsearch-controller","_message":"Reconciler error","_error":{"msg":"failed decoding raw response body into `map[string]estypes.GetIndexTemplate` for elasticsearch in namespace openshift-logging: unexpected end of JSON input"},"name":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:41:49.669333708Z {"_ts":"2022-02-14T11:41:49.669115485Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to clear transient shard allocation","_error":{"cause":{"Op":"Put","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"cluster":"elasticsearch","msg":"failed to clear shard allocation","namespace":"openshift-logging","response":""},"cluster":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:43:19.675480205Z {"_ts":"2022-02-14T11:43:19.675268168Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to check if threshold is enabled","cluster":"elasticsearch","error":{"Op":"Get","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"namespace":"openshift-logging"}
2022-02-14T11:43:49.745517779Z {"_ts":"2022-02-14T11:43:49.745430632Z","_level":"0","_component":"elasticsearch-operator","_message":"failed to get LowestClusterVersion","_error":{"msg":"Get \"https://elasticsearch.openshift-logging.svc:9200/_cluster/stats/nodes/_all\": dial tcp xxx.xxx.xxx.xxx:9200: i/o timeout"},"cluster":"elasticsearch","namespace":"openshift-logging"}
2022-02-14T11:45:19.751047448Z {"_ts":"2022-02-14T11:45:19.750805926Z","_level":"0","_component":"elasticsearch-operator","_message":"Unable to check if threshold is enabled","cluster":"elasticsearch","error":{"Op":"Get","URL":"https://elasticsearch.openshift-logging.svc:9200/_cluster/settings?include_defaults=true","Err":{"Op":"dial","Net":"tcp","Source":null,"Addr":{"IP":"xxx.xxx.xxx.xxx","Port":9200,"Zone":""},"Err":{}}},"namespace":"openshift-logging"}
This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.
Comments