Elasticsearch went into Red state with shards 'UNASSIGNED DANGLING_INDEX_IMPORTED'

Solution Verified - Updated -

Environment

  • Red Hat OpenShift Container Platform
    • 3.11

Issue

  • Elasticsearch went into Red state with a lot of Unassigned shards with the error DANGLING_INDEX_IMPORTED
$ cat health 
epoch      timestamp cluster    status node.total node.data shards pri relo init unassign pending_tasks max_task_wait_time active_shards_percent
1615977258 10:34:18  logging-es red             3         3     97  97    0    0       43             0                  -                 69.3%

$ cat unassigned_shards
project.rrs.3baf00c6-6a65-11e9-83ee-001a4a200175.2020.05.14     0 p UNASSIGNED DANGLING_INDEX_IMPORTED
project.rrs.3baf00c6-6a65-11e9-83ee-001a4a200175.2020.05.17     0 p UNASSIGNED DANGLING_INDEX_IMPORTED
.operations.2020.05.13                                          0 p UNASSIGNED DANGLING_INDEX_IMPORTED
project.cte.06b52b67-173e-11ea-97b9-001a4a200175.2020.05.21     0 p UNASSIGNED DANGLING_INDEX_IMPORTED
project.rrs.3baf00c6-6a65-11e9-83ee-001a4a200175.2020.05.19     0 p UNASSIGNED DANGLING_INDEX_IMPORTED
project.imh.68b55ff5-6a61-11e9-83ee-001a4a200175.2020.05.22     0 p UNASSIGNED DANGLING_INDEX_IMPORTED
project.cte.06b52b67-173e-11ea-97b9-001a4a200175.2020.05.18     0 p UNASSIGNED DANGLING_INDEX_IMPORTED
.operations.2020.05.22                                          0 p UNASSIGNED DANGLING_INDEX_IMPORTED
project.rrs.3baf00c6-6a65-11e9-83ee-001a4a200175.2020.05.20     0 p UNASSIGNED DANGLING_INDEX_IMPORTED
.operations.2020.05.17                                          0 p UNASSIGNED DANGLING_INDEX_IMPORTED
project.mds.521d332d-6677-11ea-af12-001a4a20017f.2020.05.22     0 p UNASSIGNED DANGLING_INDEX_IMPORTED
project.imh.68b55ff5-6a61-11e9-83ee-001a4a200175.2020.05.19     0 p UNASSIGNED DANGLING_INDEX_IMPORTED


# oc exec  $es_pod  -- es_util --query=_cluster/allocation/explain?pretty
REASON:
"can_allocate" : "no_valid_shard_copy",
  "allocate_explanation" : "cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster",

Resolution

  • The reason for the unassigned shard one can see is no_valid_shard_copy which means cannot allocate because a previous copy of the primary shard existed but can no longer be found on the nodes in the cluster as we see in the above logline.
  • This can happen if the cluster has lost/corrupted data in theprimary shards`. After all, if there are no replicas, then where would the cluster retrieve the data from. The solution here is to assign empty shards to the index. Delete these shards with the error and restart the ES cluster.
  • Restart the ES cluster using the below commands
$ for i in $(oc get dc -l component=es -o name); do oc scale $i --replicas 0; done
$ oc get po 
$ for i in $(oc get dc -l component=es -o name); do oc scale $i --replicas 1; done

Root Cause

  • The primary shard data of the ES cluster has been deleted or got corrupted.

Diagnostic Steps

  • Check the Logging Dump.
$ wget https://raw.githubusercontent.com/openshift/origin-aggregated-logging/release-3.11/hack/logging-dump.sh
$ chmod +x logging-dump.sh
$ oc login -u admin -p <password> https://openshift.example.com:8443
$ ./logging-dump.sh

This solution is part of Red Hat’s fast-track publication program, providing a huge library of solutions that Red Hat engineers have created while supporting our customers. To give you the knowledge you need the instant it becomes available, these articles may be presented in a raw and unedited form.

Comments