6.3. Analysis
Analysis is the process of converting text into single terms (words) and can be considered as one of the key features of a full-text search engine. Lucene uses the concept of Analyzers to control this process. In the following section we cover the multiple ways Hibernate Search offers to configure the analyzers.
6.3.1. Default Analyzer and Analyzer by Class
The default analyzer class used to index tokenized fields is configurable through the
hibernate.search.analyzer property. The default value for this property is org.apache.lucene.analysis.standard.StandardAnalyzer.
You can also define the analyzer class per entity, property and even per @Field (useful when multiple fields are indexed from a single property).
Example 6.12. Different ways of using @Analyzer
@Entity @Indexed @Analyzer(impl = EntityAnalyzer.class) public class MyEntity { @Id @GeneratedValue @DocumentId private Integer id; @Field private String name; @Field @Analyzer(impl = PropertyAnalyzer.class) private String summary; @Field( analyzer = @Analyzer(impl = FieldAnalyzer.class ) private String body; ... }
In this example,
EntityAnalyzer is used to index all tokenized properties (example, name), except summary and body which are indexed with PropertyAnalyzer and FieldAnalyzer respectively.
Warning
Mixing different analyzers in the same entity is most of the time a bad practice. It makes query building more complex and results less predictable (for the novice), especially if you are using a QueryParser (which uses the same analyzer for the whole query). As a rule of thumb, for any given field the same analyzer should be used for indexing and querying.
6.3.2. Named Analyzers
Analyzers can become quite complex to deal with. For this reason introduces Hibernate Search the notion of analyzer definitions. An analyzer definition can be reused by many
@Analyzer declarations and is composed of:
- a name: the unique string used to refer to the definition
- a list of char filters: each char filter is responsible to pre-process input characters before the tokenization. Char filters can add, change, or remove characters; one common usage is for characters normalization
- a tokenizer: responsible for tokenizing the input stream into individual words
- a list of filters: each filter is responsible to remove, modify, or sometimes even add words into the stream provided by the tokenizer
This separation of tasks - a list of char filters, and a tokenizer followed by a list of filters - allows for easy reuse of each individual component and let you build your customized analyzer in a very flexible way (like Lego). Generally speaking the char filters do some pre-processing in the character input, then the
Tokenizer starts the tokenizing process by turning the character input into tokens which are then further processed by the TokenFilters. Hibernate Search supports this infrastructure by utilizing the Solr analyzer framework.
Note
Some of the analyzers and filters will require additional dependencies. For example to use the snowball stemmer you have to also include the
lucene-snowball jar and for the PhoneticFilterFactory you need the commons-codec jar. Your distribution of Hibernate Search provides these dependencies in its lib/optional directory.
When using Maven all required Solr dependencies are now defined as dependencies of the artifact org.hibernate:hibernate-search-analyzers; add the following dependency :
<dependency> <groupId>org.hibernate</groupId> <artifactId>hibernate-search-analyzers</artifactId> <version>4.4.4.Final-redhat-wfk-1</version> <dependency>
Let's review a concrete example now - Example 6.13, “
@AnalyzerDef and the Solr framework”. First a char filter is defined by its factory. In our example, a mapping char filter is used, and will replace characters in the input based on the rules specified in the mapping file. Next a tokenizer is defined. This example uses the standard tokenizer. Last but not least, a list of filters is defined by their factories. In our example, the StopFilter filter is built reading the dedicated words property file. The filter is also expected to ignore case.
Example 6.13. @AnalyzerDef and the Solr framework
@AnalyzerDef(name="customanalyzer", charFilters = { @CharFilterDef(factory = MappingCharFilterFactory.class, params = { @Parameter(name = "mapping", value = "org/hibernate/search/test/analyzer/solr/mapping-chars.properties") }) }, tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class), @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = StopFilterFactory.class, params = { @Parameter(name="words", value= "org/hibernate/search/test/analyzer/solr/stoplist.properties" ), @Parameter(name="ignoreCase", value="true") }) }) public class Team { ... }
Note
Filters and char filters are applied in the order they are defined in the
@AnalyzerDef annotation. Order matters!
Some tokenizers, token filters or char filters load resources like a configuration or metadata file. This is the case for the stop filter and the synonym filter. If the resource charset is not using the VM default, you can explicitly specify it by adding a
resource_charset parameter.
Example 6.14. Use a specific charset to load the property file
@AnalyzerDef(name="customanalyzer", charFilters = { @CharFilterDef(factory = MappingCharFilterFactory.class, params = { @Parameter(name = "mapping", value = "org/hibernate/search/test/analyzer/solr/mapping-chars.properties") }) }, tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class), @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = StopFilterFactory.class, params = { @Parameter(name="words", value= "org/hibernate/search/test/analyzer/solr/stoplist.properties" ), @Parameter(name="resource_charset", value = "UTF-16BE"), @Parameter(name="ignoreCase", value="true") }) }) public class Team { ... }
Once defined, an analyzer definition can be reused by an
@Analyzer declaration as seen in Example 6.15, “Referencing an analyzer by name”.
Example 6.15. Referencing an analyzer by name
@Entity @Indexed @AnalyzerDef(name="customanalyzer", ... ) public class Team { @Id @DocumentId @GeneratedValue private Integer id; @Field private String name; @Field private String location; @Field @Analyzer(definition = "customanalyzer") private String description; }
Analyzer instances declared by
@AnalyzerDef are also available by their name in the SearchFactory which is quite useful wen building queries.
Analyzer analyzer = fullTextSession.getSearchFactory().getAnalyzer("customanalyzer");
Fields in queries must be analyzed with the same analyzer used to index the field so that they speak a common "language": the same tokens are reused between the query and the indexing process. This rule has some exceptions but is true most of the time. Respect it unless you know what you are doing.
6.3.3. Available Analyzers
Solr and Lucene come with a lot of useful default char filters, tokenizers, and filters. You can find a complete list of char filter factories, tokenizer factories and filter factories at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters. Let's check a few of them.
Table 6.1. Example of available char filters
| Factory | Description | Parameters | Additional dependencies |
|---|---|---|---|
MappingCharFilterFactory | Replaces one or more characters with one or more characters, based on mappings specified in the resource file | mapping: points to a resource file containing the mappings using the format:
| none |
HTMLStripCharFilterFactory | Remove HTML standard tags, keeping the text | none | none |
Table 6.2. Example of available tokenizers
| Factory | Description | Parameters | Additional dependencies |
|---|---|---|---|
StandardTokenizerFactory | Use the Lucene StandardTokenizer | none | none |
HTMLStripCharFilterFactory | Remove HTML tags, keep the text and pass it to a StandardTokenizer. | none | solr-core |
PatternTokenizerFactory | Breaks text at the specified regular expression pattern. | pattern: the regular expression to use for tokenizing
group: says which pattern group to extract into tokens
| solr-core |
Table 6.3. Examples of available filters
| Factory | Description | Parameters | Additional dependencies |
|---|---|---|---|
StandardFilterFactory | Remove dots from acronyms and 's from words | none | solr-core |
LowerCaseFilterFactory | Lowercases all words | none | solr-core |
StopFilterFactory | Remove words (tokens) matching a list of stop words | words: points to a resource file containing the stop words
ignoreCase: true if
case should be ignored when comparing stop words, false otherwise
| solr-core |
SnowballPorterFilterFactory | Reduces a word to it's root in a given language. (example: protect, protects, protection share the same root). Using such a filter allows searches matching related words. | language: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and a few more | solr-core |
ISOLatin1AccentFilterFactory | Remove accents for languages like French | none | solr-core |
PhoneticFilterFactory | Inserts phonetically similar tokens into the token stream | encoder: One of DoubleMetaphone, Metaphone, Soundex or RefinedSoundex
inject:
true will add tokens to the stream, false will replace the existing token
maxCodeLength: sets the maximum length of the code to be generated. Supported only for Metaphone and DoubleMetaphone encodings
| solr-core and commons-codec |
CollationKeyFilterFactory | Converts each token into its java.text.CollationKey, and then encodes the CollationKey with IndexableBinaryStringTools, to allow it to be stored as an index term. | custom, language, country, variant, strength, decompositionsee Lucene's CollationKeyFilter javadocs for more info | solr-core and commons-io |
We recommend to check all the implementations of
org.apache.solr.analysis.TokenizerFactory and org.apache.solr.analysis.TokenFilterFactory in your IDE to see the implementations available.
6.3.4. Dynamic Analyzer Selection
So far all the introduced ways to specify an analyzer were static. However, there are use cases where it is useful to select an analyzer depending on the current state of the entity to be indexed, for example in a multilingual applications. For an
BlogEntry class for example the analyzer could depend on the language property of the entry. Depending on this property the correct language specific stemmer should be chosen to index the actual text.
To enable this dynamic analyzer selection Hibernate Search introduces the
AnalyzerDiscriminator annotation. Example 6.16, “Usage of @AnalyzerDiscriminator” demonstrates the usage of this annotation.
Example 6.16. Usage of @AnalyzerDiscriminator
@Entity @Indexed @AnalyzerDefs({ @AnalyzerDef(name = "en", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = EnglishPorterFilterFactory.class ) }), @AnalyzerDef(name = "de", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = GermanStemFilterFactory.class) }) }) public class BlogEntry { @Id @GeneratedValue @DocumentId private Integer id; @Field @AnalyzerDiscriminator(impl = LanguageDiscriminator.class) private String language; @Field private String text; private Set<BlogEntry> references; // standard getter/setter ... }
public class LanguageDiscriminator implements Discriminator { public String getAnalyzerDefinitionName(Object value, Object entity, String field) { if ( value == null || !( entity instanceof Article ) ) { return null; } return (String) value; } }
The prerequisite for using
@AnalyzerDiscriminator is that all analyzers which are going to be used dynamically are predefined via @AnalyzerDef definitions. If this is the case, one can place the @AnalyzerDiscriminator annotation either on the class or on a specific property of the entity for which to dynamically select an analyzer. Via the impl parameter of the AnalyzerDiscriminator you specify a concrete implementation of the Discriminator interface. It is up to you to provide an implementation for this interface. The only method you have to implement is getAnalyzerDefinitionName() which gets called for each field added to the Lucene document. The entity which is getting indexed is also passed to the interface method. The value parameter is only set if the AnalyzerDiscriminator is placed on property level instead of class level. In this case the value represents the current value of this property.
An implementation of the
Discriminator interface has to return the name of an existing analyzer definition or null if the default analyzer should not be overridden. Example 6.16, “Usage of @AnalyzerDiscriminator” assumes that the language parameter is either 'de' or 'en' which matches the specified names in the @AnalyzerDefs.
6.3.5. Retrieving an Analyzer
Retrieving an analyzer can be used when multiple analyzers have been used in a domain model, in order to benefit from stemming or phonetic approximation, etc. In this case, use the same analyzers to building a query. Alternatively, use the Hibernate Search query DSL, which selects the correct analyzer automatically. See Section 7.1.2, “Building a Lucene Query”
Whether you are using the Lucene programmatic API or the Lucene query parser, you can retrieve the scoped analyzer for a given entity. A scoped analyzer is an analyzer which applies the right analyzers depending on the field indexed. Remember, multiple analyzers can be defined on a given entity each one working on an individual field. A scoped analyzer unifies all these analyzers into a context-aware analyzer. While the theory seems a bit complex, using the right analyzer in a query is very easy.
Note
When you use programmatic mapping for a child entity, you can only see the fields defined by the child entity. Fields or methods inherited from a parent entity (annotated with @MappedSuperclass) are not configurable. To configure properties inherited from a parent entity, either override the property in the child entity or create a programmatic mapping for the parent entity. This mimics the usage of annotations where you cannot annotate a field or method of a parent entity unless it is redefined in the child entity.
Example 6.17. Using the scoped analyzer when building a full-text query
org.apache.lucene.queryParser.QueryParser parser = new QueryParser( "title", fullTextSession.getSearchFactory().getAnalyzer( Song.class ) ); org.apache.lucene.search.Query luceneQuery = parser.parse( "title:sky Or title_stemmed:diamond" ); org.hibernate.Query fullTextQuery = fullTextSession.createFullTextQuery( luceneQuery, Song.class ); List result = fullTextQuery.list(); //return a list of managed objects
In the example above, the song title is indexed in two fields: the standard analyzer is used in the field
title and a stemming analyzer is used in the field title_stemmed. By using the analyzer provided by the search factory, the query uses the appropriate analyzer depending on the field targeted.
Note
You can also retrieve analyzers defined via
@AnalyzerDef by their definition name using searchFactory.getAnalyzer(String).