5.5. Tuning Lucene Indexing
5.5.1. Tuning Lucene Indexing Performance
Hibernate Search is used to tune the Lucene indexing performance by specifying a set of parameters which are passed through to underlying Lucene
IndexWriter such as mergeFactor, maxMergeDocs, and maxBufferedDocs. Specify these parameters either as default values applying for all indexes, on a per index basis, or even per shard.
There are several low level
IndexWriter settings which can be tuned for different use cases. These parameters are grouped by the indexwriter keyword:
hibernate.search.[default|<indexname>].indexwriter.<parameter_name>
If no value is set for an
indexwriter value in a specific shard configuration, Hibernate Search checks the index section, then at the default section.
The configuration in the following table will result in these settings applied on the second shard of the
Animal index:
max_merge_docs= 10merge_factor= 20ram_buffer_size= 64MBterm_index_interval= Lucene default
All other values will use the defaults defined in Lucene.
The default for all values is to leave them at Lucene's own default. The values listed in Table 5.6, “List of indexing performance and behavior properties” depend for this reason on the version of Lucene you are using. The values shown are relative to version
2.4. For more information about Lucene indexing performance, see the Lucene documentation.
Note
Previous versions of Search had the notion of
batch and transaction properties. This is no longer the case as the backend will always perform work using the same settings.
Table 5.6. List of indexing performance and behavior properties
| Property | Description | Default Value |
|---|---|---|
|
hibernate.search.[default|<indexname>].exclusive_index_use
|
Set to
true when no other process will need to write to the same index. This enables Hibernate Search to work in exclusive mode on the index and improve performance when writing changes to the index.
| true (improved performance, releases locks only at shutdown) |
|
hibernate.search.[default|<indexname>].max_queue_length
|
Each index has a separate "pipeline" which contains the updates to be applied to the index. When this queue is full adding more operations to the queue becomes a blocking operation. Configuring this setting doesn't make much sense unless the
worker.execution is configured as async.
| 1000 |
|
hibernate.search.[default|<indexname>].indexwriter.max_buffered_delete_terms
|
Determines the minimal number of delete terms required before the buffered in-memory delete terms are applied and flushed. If there are documents buffered in memory at the time, they are merged and a new segment is created.
| Disabled (flushes by RAM usage) |
|
hibernate.search.[default|<indexname>].indexwriter.max_buffered_docs
|
Controls the amount of documents buffered in memory during indexing. The bigger the more RAM is consumed.
| Disabled (flushes by RAM usage) |
|
hibernate.search.[default|<indexname>].indexwriter.max_merge_docs
|
Defines the largest number of documents allowed in a segment. Smaller values perform better on frequently changing indexes, larger values provide better search performance if the index does not change often.
| Unlimited (Integer.MAX_VALUE) |
|
hibernate.search.[default|<indexname>].indexwriter.merge_factor
|
Controls segment merge frequency and size.
Determines how often segment indexes are merged when insertion occurs. With smaller values, less RAM is used while indexing, and searches on unoptimized indexes are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indexes are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indexes that are interactively maintained. The value must not be lower than 2.
| 10 |
|
hibernate.search.[default|<indexname>].indexwriter.merge_min_size
|
Controls segment merge frequency and size.
Segments smaller than this size (in MB) are always considered for the next segment merge operation.
Setting this too large might result in expensive merge operations, even tough they are less frequent.
See also
org.apache.lucene.index.LogDocMergePolicy. minMergeSize.
| 0 MB (actually ~1K) |
|
hibernate.search.[default|<indexname>].indexwriter.merge_max_size
|
Controls segment merge frequency and size.
Segments larger than this size (in MB) are never merged in bigger segments.
This helps reduce memory requirements and avoids some merging operations at the cost of optimal search speed. When optimizing an index this value is ignored.
See also
org.apache.lucene.index.LogDocMergePolicy. maxMergeSize.
| Unlimited |
|
hibernate.search.[default|<indexname>].indexwriter.merge_max_optimize_size
|
Controls segment merge frequency and size.
Segments larger than this size (in MB) are not merged in bigger segments even when optimizing the index (see
merge_max_size setting as well).
Applied to
org.apache.lucene.index.LogDocMergePolicy. maxMergeSizeForOptimize.
| Unlimited |
|
hibernate.search.[default|<indexname>].indexwriter.merge_calibrate_by_deletes
|
Controls segment merge frequency and size.
Set to
false to not consider deleted documents when estimating the merge policy.
Applied to
org.apache.lucene.index.LogMergePolicy. calibrateSizeByDeletes.
| true |
|
hibernate.search.[default|<indexname>].indexwriter.ram_buffer_size
|
Controls the amount of RAM in MB dedicated to document buffers. When used together max_buffered_docs a flush occurs for whichever event happens first.
Generally for faster indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can.
| 16 MB |
|
hibernate.search.[default|<indexname>].indexwriter.term_index_interval
|
Expert: Set the interval between indexed terms.
Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms. See Lucene documentation for more details.
| 128 |
|
hibernate.search.[default|<indexname>].indexwriter.use_compound_file
| The advantage of using the compound file format is that less file descriptors are used. The disadvantage is that indexing takes more time and temporary disk space. You can set this parameter to false in an attempt to improve the indexing time, but you could run out of file descriptors if mergeFactor is also large.
Boolean parameter, use "
true" or "false". The default value for this option is true.
| true |
|
hibernate.search.enable_dirty_check
|
Not all entity changes require a Lucene index update. If all of the updated entity properties (dirty properties) are not indexed, Hibernate Search skips the re-indexing process.
Disable this option if you use custom
FieldBridges which need to be invoked at each update event (even though the property for which the field bridge is configured has not changed).
This optimization will not be applied on classes using a
@ClassBridge or a @DynamicBoost.
Boolean parameter, use "
true" or "false". The default value for this option is true.
| true |
Warning
The
blackhole backend is not meant to be used in production, only as a tool to identify indexing bottlenecks.
5.5.2. The Lucene IndexWriter
There are several low level
IndexWriter settings which can be tuned for different use cases. These parameters are grouped by the indexwriter keyword:
default.<indexname>.indexwriter.<parameter_name>
If no value is set for
indexwriter in a shard configuration, Hibernate Search looks at the index section and then at the default section.
5.5.3. Performance Option Configuration
The following configuration will result in these settings being applied on the second shard of the
Animal index:
Example 5.6. Example performance option configuration
default.Animals.2.indexwriter.max_merge_docs = 10 default.Animals.2.indexwriter.merge_factor = 20 default.Animals.2.indexwriter.term_index_interval = default default.indexwriter.max_merge_docs = 100 default.indexwriter.ram_buffer_size = 64
max_merge_docs= 10merge_factor= 20ram_buffer_size= 64MBterm_index_interval= Lucene default
All other values will use the defaults defined in Lucene.
The Lucene default values are the default setting for Hibernate Search. Therefore, the values listed in the following table depend on the version of Lucene being used. The values shown are relative to version
2.4. For more information about Lucene indexing performance, see the Lucene documentation.
Note
The back end will always perform work using the same settings.
Table 5.7. List of indexing performance and behavior properties
| Property | Description | Default Value |
|---|---|---|
|
default.<indexname>.exclusive_index_use
|
Set to
true when no other process will need to write to the same index. This enables Hibernate Search to work in exclusive mode on the index and improve performance when writing changes to the index.
| true (improved performance, releases locks only at shutdown) |
|
default.<indexname>.max_queue_length
|
Each index has a separate "pipeline" which contains the updates to be applied to the index. When this queue is full adding more operations to the queue becomes a blocking operation. Configuring this setting doesn't make much sense unless the
worker.execution is configured as async.
| 1000 |
|
default.<indexname>.indexwriter.max_buffered_delete_terms
|
Determines the minimal number of delete terms required before the buffered in-memory delete terms are applied and flushed. If there are documents buffered in memory at the time, they are merged and a new segment is created.
| Disabled (flushes by RAM usage) |
|
default.<indexname>.indexwriter.max_buffered_docs
|
Controls the amount of documents buffered in memory during indexing. The bigger the more RAM is consumed.
| Disabled (flushes by RAM usage) |
|
default.<indexname>.indexwriter.max_merge_docs
|
Defines the largest number of documents allowed in a segment. Smaller values perform better on frequently changing indexes, larger values provide better search performance if the index does not change often.
| Unlimited (Integer.MAX_VALUE) |
|
default.<indexname>.indexwriter.merge_factor
|
Controls segment merge frequency and size.
Determines how often segment indexes are merged when insertion occurs. With smaller values, less RAM is used while indexing, and searches on unoptimized indexes are faster, but indexing speed is slower. With larger values, more RAM is used during indexing, and while searches on unoptimized indexes are slower, indexing is faster. Thus larger values (> 10) are best for batch index creation, and smaller values (< 10) for indexes that are interactively maintained. The value must not be lower than 2.
| 10 |
|
default.<indexname>.indexwriter.merge_min_size
|
Controls segment merge frequency and size.
Segments smaller than this size (in MB) are always considered for the next segment merge operation.
Setting this too large might result in expensive merge operations, even tough they are less frequent.
See also
org.apache.lucene.index.LogDocMergePolicy. minMergeSize.
| 0 MB (actually ~1K) |
|
default.<indexname>.indexwriter.merge_max_size
|
Controls segment merge frequency and size.
Segments larger than this size (in MB) are never merged in bigger segments.
This helps reduce memory requirements and avoids some merging operations at the cost of optimal search speed. When optimizing an index this value is ignored.
See also
org.apache.lucene.index.LogDocMergePolicy. maxMergeSize.
| Unlimited |
|
default.<indexname>.indexwriter.merge_max_optimize_size
|
Controls segment merge frequency and size.
Segments larger than this size (in MB) are not merged in bigger segments even when optimizing the index (see
merge_max_size setting as well).
Applied to
org.apache.lucene.index.LogDocMergePolicy. maxMergeSizeForOptimize.
| Unlimited |
|
default.<indexname>.indexwriter.merge_calibrate_by_deletes
|
Controls segment merge frequency and size.
Set to
false to not consider deleted documents when estimating the merge policy.
Applied to
org.apache.lucene.index.LogMergePolicy. calibrateSizeByDeletes.
| true |
|
default.<indexname>.indexwriter.ram_buffer_size
|
Controls the amount of RAM in MB dedicated to document buffers. When used together max_buffered_docs a flush occurs for whichever event happens first.
Generally for faster indexing performance it's best to flush by RAM usage instead of document count and use as large a RAM buffer as you can.
| 16 MB |
|
default.<indexname>.indexwriter.term_index_interval
|
Expert: Set the interval between indexed terms.
Large values cause less memory to be used by IndexReader, but slow random-access to terms. Small values cause more memory to be used by an IndexReader, and speed random-access to terms. See Lucene documentation for more details.
| 128 |
|
default.<indexname>.indexwriter.use_compound_file
| The advantage of using the compound file format is that less file descriptors are used. The disadvantage is that indexing takes more time and temporary disk space. You can set this parameter to false in an attempt to improve the indexing time, but you could run out of file descriptors if mergeFactor is also large.
Boolean parameter, use "
true" or "false". The default value for this option is true.
| true |
|
default.enable_dirty_check
|
Not all entity changes require a Lucene index update. If all of the updated entity properties (dirty properties) are not indexed, Hibernate Search skips the re-indexing process.
Disable this option if you use custom
FieldBridges which need to be invoked at each update event (even though the property for which the field bridge is configured has not changed).
This optimization will not be applied on classes using a
@ClassBridge or a @DynamicBoost.
Boolean parameter, use "
true" or "false". The default value for this option is true.
| true |
5.5.4. Tuning the Indexing Speed
When the architecture permits it, keep
default.exclusive_index_use=true for improved index writing efficiency.
To tune the indexing speed, time the object loading from the database in isolation from the writes to the index. Set the
blackhole as worker back end and start your indexing routines. This back end does not disable Hibernate Search: it generates the required change sets to the index, but discards them instead of flushing them to the index. In contrast to setting the hibernate.search.indexing_strategy to manual, using blackhole will possibly load more data from the database because associated entities are re-indexed as well.
hibernate.search.[default|<indexname>].worker.backend blackhole
The recommended approach is to focus first on optimizing the object loading, and then use the timings you achieve as a baseline to tune the indexing process.
Warning
The
blackhole back end is not meant to be used in production, only as a tool to identify indexing bottlenecks.
5.5.5. Control Segment Size
The following options configure the maximum size of segments created:
merge_max_sizemerge_max_optimize_sizemerge_calibrate_by_deletes
Example:
//to be fairly confident no files grow above 15MB, use: hibernate.search.default.indexwriter.ram_buffer_size = 10 hibernate.search.default.indexwriter.merge_max_optimize_size = 7 hibernate.search.default.indexwriter.merge_max_size = 7
Set the
max_size for merge operations to less than half of the hard limit segment size, as merging segments combines two segments into one larger segment.
A new segment may initially be a larger size than expected, however a segment is never created significantly larger than the
ram_buffer_size. This threshold is checked as an estimate.