8.3. Rebuilding the Index
If you change the entity mapping to the index, chances are that the whole Index needs to be updated; For example if you decide to index a an existing field using a different analyzer you'll need to rebuild the index for affected types. Also if the Database is replaced (like restored from a backup, imported from a legacy system) you'll want to be able to rebuild the index from existing data. Hibernate Search provides two main strategies to choose from:
- Using
FullTextSession.flushToIndexes()periodically, while usingFullTextSession.index()on all entities. - Use a
MassIndexer.
8.3.1. Using flushToIndexes()
This strategy consists in removing the existing index and then adding all entities back to the index using
FullTextSession.purgeAll() and FullTextSession.index(), however there are some memory and efficiency constraints. For maximum efficiency Hibernate Search batches index operations and executes them at commit time. If you expect to index a lot of data you need to be careful about memory consumption since all documents are kept in a queue until the transaction commit. You can potentially face an OutOfMemoryException if you don't empty the queue periodically: to do this you can use fullTextSession.flushToIndexes(). Every time fullTextSession.flushToIndexes() is called (or if the transaction is committed), the batch queue is processed applying all index changes. Be aware that, once flushed, the changes cannot be rolled back.
Example 8.4. Index rebuilding using index() and flushToIndexes()
fullTextSession.setFlushMode(FlushMode.MANUAL); fullTextSession.setCacheMode(CacheMode.IGNORE); transaction = fullTextSession.beginTransaction(); //Scrollable results will avoid loading too many objects in memory ScrollableResults results = fullTextSession.createCriteria( Email.class ) .setFetchSize(BATCH_SIZE) .scroll( ScrollMode.FORWARD_ONLY ); int index = 0; while( results.next() ) { index++; fullTextSession.index( results.get(0) ); //index each element if (index % BATCH_SIZE == 0) { fullTextSession.flushToIndexes(); //apply changes to indexes fullTextSession.clear(); //free memory since the queue is processed } } transaction.commit();
Note
hibernate.search.default.worker.batch_size has been deprecated in favor of this explicit API which provides better control
Try to use a batch size that guarantees that your application willis out of memory: with a bigger batch size objects are fetched faster from database but more memory is needed.
8.3.2. Using a MassIndexer
Hibernate Search's
MassIndexer uses several parallel threads to rebuild the index; you can optionally select which entities need to be reloaded or have it reindex all entities. This approach is optimized for best performance but requires to set the application in maintenance mode: making queries to the index is not recommended when a MassIndexer is busy.
This will rebuild the index, deleting it and then reloading all entities from the database. Although it is simple to use, some tweaking is recommended to speed up the process: there are several parameters configurable.
Warning
During the progress of a MassIndexer the content of the index is undefined! If a query is performed while the MassIndexer is working most likely some results will be missing.
Example 8.6. Using a Tuned MassIndexer
fullTextSession .createIndexer( User.class ) .batchSizeToLoadObjects( 25 ) .cacheMode( CacheMode.NORMAL ) .threadsToLoadObjects( 12 ) .idFetchSize( 150 ) .progressMonitor( monitor ) //a MassIndexerProgressMonitor implementation .startAndWait();
This will rebuild the index of all User instances (and subtypes), and will create 12 parallel threads to load the User instances using batches of 25 objects per query. These same 12 threads will also need to process indexed embedded relations and custom
FieldBridges or ClassBridges to output a Lucene document. The threads trigger lazyloading of additional attributes during the conversion process. Because of this, a high number of threads working in parallel is required. The number of threads working on actual index writing is defined by the backend configuration of each index. .
Generally we suggest to leave cacheMode to
CacheMode.IGNORE (the default), as in most reindexing situations the cache will be a useless additional overhead; it might be useful to enable some other CacheMode depending on your data: it could increase performance if the main entity is relating to enum-like data included in the index.
Note
The ideal of number of threads to achieve best performance is highly dependent on your overall architecture, database design and data values. All internal thread groups have meaningful names so they should be easily identified with most diagnostic tools, including threaddumps.
Note
The MassIndexer is unaware of transactions, therefore there is no need to begin one or committing. Also because it is not transactional it is not recommended to let users use the system during its processing, as it is unlikely people will be able to find results and the system load might be too high anyway.
Other parameters which affect indexing time and memory consumption are:
hibernate.search.[default|<indexname>].exclusive_index_usehibernate.search.[default|<indexname>].indexwriter.max_buffered_docshibernate.search.[default|<indexname>].indexwriter.max_merge_docshibernate.search.[default|<indexname>].indexwriter.merge_factorhibernate.search.[default|<indexname>].indexwriter.merge_min_sizehibernate.search.[default|<indexname>].indexwriter.merge_max_sizehibernate.search.[default|<indexname>].indexwriter.merge_max_optimize_sizehibernate.search.[default|<indexname>].indexwriter.merge_calibrate_by_deleteshibernate.search.[default|<indexname>].indexwriter.ram_buffer_sizehibernate.search.[default|<indexname>].indexwriter.term_index_interval
Previous versions also had a
max_field_length but this was removed from Lucene, it's possible to obtain a similar effect by using a LimitTokenCountAnalyzer.
All
.indexwriter parameters are Lucene specific and Hibernate Search is just passing these parameters through - see Section 5.5.1, “Tuning Lucene Indexing Performance” for more details.
The
MassIndexer uses a forward only scrollable result to iterate on the primary keys to be loaded, but MySQL's JDBC driver will load all values in memory; to avoid this "optimization" set idFetchSize to Integer.MIN_VALUE.