11.4. Sharding Indexes

In some cases it can be useful to split (shard) the indexed data of a given entity into several Lucene indexes.

Warning

This solution is not recommended unless there is a pressing need. Searches will be slower as all shards have to be opened for a single search. Don't do it until you have a real use case!
Possible use cases for sharding are:
  • A single index is so huge that index update times are slowing the application down.
  • A typical search will only hit a sub-set of the index, such as when data is naturally segmented by customer, region or application.
By default sharding is not enabled unless the number of shards is configured. To do this use the hibernate.search.<indexName>.sharding_strategy.nbr_of_shards property as seen in Example 11.4, “Enabling Index Sharding”. In this example 5 shards are enabled.

Example 11.4. Enabling Index Sharding

hibernate.search.<indexName>.sharding_strategy.nbr_of_shards = 5
Responsible for splitting the data into sub-indexes is the IndexShardingStrategy. The default sharding strategy splits the data according to the hash value of the id string representation (generated by the FieldBridge). This ensures a fairly balanced sharding. You can replace the default strategy by implementing a custom IndexShardingStrategy. To use your custom strategy you have to set the hibernate.search.<indexName>.sharding_strategy property.

Example 11.5. Specifying a Custom Sharding Strategy

hibernate.search.<indexName>.sharding_strategy = my.shardingstrategy.Implementation
The IndexShardingStrategy also allows for optimizing searches by selecting which shard to run the query against. By activating a filter (see Section 7.3.1, “Using Filters in a Sharded Environment”), a sharding strategy can select a subset of the shards used to answer a query (IndexShardingStrategy.getIndexManagersForQuery) and thus speed up the query execution.
Each shard has an independent IndexManager and so can be configured to use a different directory provider and back end configurations. The IndexManager index names for the Animal entity in Example 11.6, “Sharding Configuration for Entity Animal” are Animal.0 to Animal.4. In other words, each shard has the name of its owning index followed by . (dot) and its index number (see also Section 5.3, “Directory Configuration”).

Example 11.6. Sharding Configuration for Entity Animal

hibernate.search.default.indexBase = /usr/lucene/indexes
hibernate.search.Animal.sharding_strategy.nbr_of_shards = 5
hibernate.search.Animal.directory_provider = filesystem
hibernate.search.Animal.0.indexName = Animal00 
hibernate.search.Animal.3.indexBase = /usr/lucene/sharded
hibernate.search.Animal.3.indexName = Animal03
In Example 11.6, “Sharding Configuration for Entity Animal”, the configuration uses the default id string hashing strategy and shards the Animal index into 5 sub-indexes. All sub-indexes are filesystem instances and the directory where each sub-index is stored is as followed:
  • for sub-index 0: /usr/lucene/indexes/Animal00 (shared indexBase but overridden indexName)
  • for sub-index 1: /usr/lucene/indexes/Animal.1 (shared indexBase, default indexName)
  • for sub-index 2: /usr/lucene/indexes/Animal.2 (shared indexBase, default indexName)
  • for sub-index 3: /usr/lucene/shared/Animal03 (overridden indexBase, overridden indexName)
  • for sub-index 4: /usr/lucene/indexes/Animal.4 (shared indexBase, default indexName)
When implementing a IndexShardingStrategy any field can be used to determine the sharding selection. Consider that to handle deletions, purge and purgeAll operations, the implementation might need to return one or more indexes without being able to read all the field values or the primary identifier; in case the information is not enough to pick a single index, all indexes should be returned, so that the delete operation will be propagated to all indexes potentially containing the documents to be deleted.