Language:

English
Format:

14.5. Analysis

In the Query Module, the process of converting text into single terms is called Analysis and is a key feature of the full-text search engine. Lucene uses Analyzers to control this process.

Report a bug

14.5.1. Default Analyzer and Analyzer by Class

The default analyzer class is used to index tokenized fields, and is configurable through the default.analyzer property. The default value for this property is org.apache.lucene.analysis.standard.StandardAnalyzer.

The analyzer class can be defined per entity, property, and per @Field, which is useful when multiple fields are indexed from a single property.

In the following example, EntityAnalyzer is used to index all tokenized properties, such as name except, summary and body, which are indexed with PropertyAnalyzer and FieldAnalyzer respectively.

Example 14.9. Different ways of using @Analyzer

@Indexed
@Analyzer(impl = EntityAnalyzer.class)
public class MyEntity {

    @Field
    private String name;

    @Field
    @Analyzer(impl = PropertyAnalyzer.class)
    private String summary;

    @Field(analyzer = @Analyzer(impl = FieldAnalyzer.class))
    private String body;
}

Note

Avoid using different analyzers on a single entity. Doing so can create complications in building queries, and make results less predictable, particularly if using a QueryParser. Use the same analyzer for indexing and querying on any field.

Report a bug

14.5.2. Named Analyzers

The Query Module uses analyzer definitions to deal with the complexity of the Analyzer function. Analyzer definitions are reusable by multiple @Analyzer declarations and includes the following:

a name: the unique string used to refer to the definition.
a list of CharFilters: each CharFilter is responsible to pre-process input characters before the tokenization. CharFilters can add, change, or remove characters. One common usage is for character normalization.
a Tokenizer: responsible for tokenizing the input stream into individual words.
a list of filters: each filter is responsible to remove, modify, or sometimes add words into the stream provided by the Tokenizer.

The Analyzer separates these components into multiple tasks, allowing individual components to be reused and components to be built with flexibility using the following procedure:

Procedure 14.1. The Analyzer Process

The CharFilters process the character input.
Tokenizer converts the character input into tokens.
The tokens are the processed by the TokenFilters.

The Lucene-based Query API supports this infrastructure by utilizing the Solr analyzer framework.

Report a bug

14.5.3. Analyzer Definitions

Once defined, an analyzer definition can be reused by an @Analyzer annotation.

Example 14.10. Referencing an analyzer by name

@Indexed
@AnalyzerDef(name = "customanalyzer")
public class Team {

    @Field
    private String name;

    @Field
    private String location;

    @Field 
    @Analyzer(definition = "customanalyzer")
    private String description;
}

Analyzer instances declared by @AnalyzerDef are also available by their name in the SearchFactory, which is useful when building queries.

Analyzer analyzer = Search.getSearchManager(cache).getSearchFactory().getAnalyzer("customanalyzer")

When querying, fields must use the same analyzer that has been used to index the field. The same tokens are reused between the query and the indexing process.

Report a bug

14.5.4. @AnalyzerDef for Solr

When using Maven all required Apache Solr dependencies are now defined as dependencies of the artifact org.hibernate:hibernate-search-analyzers. Add the following dependency:

<dependency>
    <groupId>org.hibernate</groupId>
    <artifactId>hibernate-search-analyzers</artifactId>
    <version>${version.hibernate.search}</version>
<dependency>

In the following example, a CharFilter is defined by its factory. In this example, a mapping char filter is used, which will replace characters in the input based on the rules specified in the mapping file. Finally, a list of filters is defined by their factories. In this example, the StopFilter filter is built reading the dedicated words property file. The filter will ignore case.

Procedure 14.2. @AnalyzerDef and the Solr framework

Configure the CharFilter

Define a CharFilter by factory. In this example, a mapping CharFilter is used, which will replace characters in the input based on the rules specified in the mapping file.

@AnalyzerDef(name = "customanalyzer",
    charFilters = {
        @CharFilterDef(factory = MappingCharFilterFactory.class, params = {
            @Parameter(name = "mapping",
                value = 
                    "org/hibernate/search/test/analyzer/solr/mapping-chars.properties")
        })
    },

Define the Tokenizer

A Tokenizer is then defined using the StandardTokenizerFactory.class.

@AnalyzerDef(name = "customanalyzer",
    charFilters = {
        @CharFilterDef(factory = MappingCharFilterFactory.class, params = {
            @Parameter(name = "mapping",
                value = 
                    "org/hibernate/search/test/analyzer/solr/mapping-chars.properties")
        })
    },
  
    tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class)

List of Filters

Define a list of filters by their factories. In this example, the StopFilter filter is built reading the dedicated words property file. The filter will ignore case.

@AnalyzerDef(name = "customanalyzer",
    charFilters = {
        @CharFilterDef(factory = MappingCharFilterFactory.class, params = {
            @Parameter(name = "mapping",
                value =
                    "org/hibernate/search/test/analyzer/solr/mapping-chars.properties")
        })
    },
  
    tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
    filters = {
                 
        @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class),
        @TokenFilterDef(factory = StopFilterFactory.class, params = {
            @Parameter(name = "words",
                value= "org/hibernate/search/test/analyzer/solr/stoplist.properties" ),
            @Parameter(name = "ignoreCase", value = "true")
        })
    })
public class Team {
}

Note

Filters and CharFilters are applied in the order they are defined in the @AnalyzerDef annotation.

Report a bug

14.5.5. Loading Analyzer Resources

Tokenizers, TokenFilters, and CharFilters can load resources such as configuration or metadata files using the StopFilterFactory.class or the synonym filter. The virtual machine default can be explicitly specified by adding a resource_charset parameter.

Example 14.11. Use a specific charset to load the property file

@AnalyzerDef(name = "customanalyzer",
    charFilters = {
        @CharFilterDef(factory = MappingCharFilterFactory.class, params = {
            @Parameter(name = "mapping",
                value = 
                    "org/hibernate/search/test/analyzer/solr/mapping-chars.properties")
        })
    },
    tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
    filters = {
        @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class),
        @TokenFilterDef(factory = LowerCaseFilterFactory.class),
        @TokenFilterDef(factory = StopFilterFactory.class, params = {
            @Parameter(name="words",
                value= "org/hibernate/search/test/analyzer/solr/stoplist.properties"),
            @Parameter(name = "resource_charset", value = "UTF-16BE"),
            @Parameter(name = "ignoreCase", value = "true")
        })
    })
public class Team {
}

Report a bug

14.5.6. Dynamic Analyzer Selection

The Query Module uses the @AnalyzerDiscriminator annotation to enable the dynamic analyzer selection.

An analyzer can be selected based on the current state of an entity that is to be indexed. This is particularly useful in multilingual applications. For example, when using the BlogEntry class, the analyzer can depend on the language property of the entry. Depending on this property, the correct language-specific stemmer can then be chosen to index the text.

An implementation of the Discriminator interface must return the name of an existing Analyzer definition, or null if the default analyzer is not overridden.

The following example assumes that the language parameter is either 'de' or 'en', which is specified in the @AnalyzerDefs.

Procedure 14.3. Configure the @AnalyzerDiscriminator

Predefine Dynamic Analyzers

The @AnalyzerDiscriminator requires that all analyzers that are to be used dynamically are predefined via @AnalyzerDef. The @AnalyzerDiscriminator annotation can then be placed either on the class, or on a specific property of the entity, in order to dynamically select an analyzer. An implementation of the Discriminator interface can be specified using the @AnalyzerDiscriminator impl parameter.

@Indexed
@AnalyzerDefs({
    @AnalyzerDef(name = "en",
        tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
        filters = {
            @TokenFilterDef(factory = LowerCaseFilterFactory.class),
            @TokenFilterDef(factory = EnglishPorterFilterFactory.class)
        }),
    @AnalyzerDef(name = "de",
        tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class),
        filters = {
            @TokenFilterDef(factory = LowerCaseFilterFactory.class),
            @TokenFilterDef(factory = GermanStemFilterFactory.class)
        })
    })
public class BlogEntry {

    @Field
    @AnalyzerDiscriminator(impl = LanguageDiscriminator.class)
    private String language;
  
    @Field
    private String text;
  
    private Set<BlogEntry> references;
  
    // standard getter/setter    
}

Implement the Discriminator Interface
Implement the getAnalyzerDefinitionName() method, which is called for each field added to the Lucene document. The entity being indexed is also passed to the interface method.
The value parameter is set if the @AnalyzerDiscriminator is placed on the property level instead of the class level. In this example, the value represents the current value of this property.
```
public class LanguageDiscriminator implements Discriminator {
    public String getAnalyzerDefinitionName(Object value, Object entity, String field) {
        if (value == null || !(entity instanceof Article)) {
            return null;
        }
        return (String) value;
    }
}
```

Report a bug

14.5.7. Retrieving an Analyzer

Retrieving an analyzer can be used when multiple analyzers have been used in a domain model, in order to benefit from stemming or phonetic approximation, etc. In this case, use the same analyzers to building a query. Alternatively, use the Lucene-based Query API, which selects the correct analyzer automatically. See Section 15.1.2, “Building a Lucene Query”.

The scoped analyzer for a given entity can be retrieved using either the Lucene programmatic API or the Lucene query parser. A scoped analyzer applies the right analyzers depending on the field indexed. Multiple analyzers can be defined on a given entity, each working on an individual field. A scoped analyzer unifies these analyzers into a context-aware analyzer.

In the following example, the song title is indexed in two fields:

Standard analyzer: used in the title field.
Stemming analyzer: used in the title_stemmed field.

Using the analyzer provided by the search factory, the query uses the appropriate analyzer depending on the field targeted.

Example 14.12. Using the scoped analyzer when building a full-text query

SearchManager manager = Search.getSearchManager(cache);

org.apache.lucene.queryParser.QueryParser parser = new QueryParser(
    org.apache.lucene.util.Version.LUCENE_36,
    "title", 
    manager.getSearchFactory().getAnalyzer(Song.class)
);

org.apache.lucene.search.Query luceneQuery = 
    parser.parse("title:sky Or title_stemmed:diamond");

// wrap Lucene query in a org.infinispan.query.CacheQuery
CacheQuery cacheQuery = manager.getQuery(luceneQuery, Song.class);

List result = cacheQuery.list(); 
//return the list of matching objects

Note

Analyzers defined via @AnalyzerDef can also be retrieved by their definition name using searchFactory.getAnalyzer(String).

Report a bug

14.5.8. Available Analyzers

Apache Solr and Lucene ship with a number of default CharFilters, tokenizers, and filters. A complete list of CharFilter, tokenizer, and filter factories is available at http://wiki.apache.org/solr/AnalyzersTokenizersTokenFilters. The following tables provide some example CharFilters, tokenizers, and filters.

Table 14.1. Example of available CharFilters

Factory	Description	Parameters	Additional dependencies
`MappingCharFilterFactory`	Replaces one or more characters with one or more characters, based on mappings specified in the resource file	`mapping`: points to a resource file containing the mappings using the format: "á" => "a" "ñ" => "n" "ø" => "o"	none
`HTMLStripCharFilterFactory`	Remove HTML standard tags, keeping the text	none	none

Table 14.2. Example of available tokenizers

Factory	Description	Parameters	Additional dependencies
`StandardTokenizerFactory`	Use the Lucene StandardTokenizer	none	none
`HTMLStripCharFilterFactory`	Remove HTML tags, keep the text and pass it to a StandardTokenizer.	none	`solr-core`
`PatternTokenizerFactory`	Breaks text at the specified regular expression pattern.	`pattern`: the regular expression to use for tokenizing group: says which pattern group to extract into tokens	`solr-core`

Table 14.3. Examples of available filters

Factory	Description	Parameters	Additional dependencies
`StandardFilterFactory`	Remove dots from acronyms and 's from words	none	`solr-core`
`LowerCaseFilterFactory`	Lowercases all words	none	`solr-core`
`StopFilterFactory`	Remove words (tokens) matching a list of stop words	`words`: points to a resource file containing the stop words ignoreCase: true if `case` should be ignored when comparing stop words, `false` otherwise	`solr-core`
`SnowballPorterFilterFactory`	Reduces a word to it's root in a given language. (example: protect, protects, protection share the same root). Using such a filter allows searches matching related words.	`language`: Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish and a few more	`solr-core`
`ISOLatin1AccentFilterFactory`	Remove accents for languages like French	none	`solr-core`
`PhoneticFilterFactory`	Inserts phonetically similar tokens into the token stream	`encoder`: One of DoubleMetaphone, Metaphone, Soundex or RefinedSoundex inject: `true` will add tokens to the stream, `false` will replace the existing token `maxCodeLength`: sets the maximum length of the code to be generated. Supported only for Metaphone and DoubleMetaphone encodings	`solr-core` and `commons-codec`
`CollationKeyFilterFactory`	Converts each token into its `java.text.CollationKey`, and then encodes the `CollationKey` with `IndexableBinaryStringTools`, to allow it to be stored as an index term.	`custom`, `language`, `country`, `variant`, `strength`, `decomposition`see Lucene's `CollationKeyFilter` javadocs for more info	`solr-core` and `commons-io`

It is recommended that all implementations of org.apache.solr.analysis.TokenizerFactory and org.apache.solr.analysis.TokenFilterFactory are checked in your IDE to see available implementations.

Report a bug

Select Your Language

Language:

Language:

14.5. Analysis

14.5.1. Default Analyzer and Analyzer by Class

14.5.2. Named Analyzers

14.5.3. Analyzer Definitions

14.5.4. @AnalyzerDef for Solr

14.5.5. Loading Analyzer Resources

14.5.6. Dynamic Analyzer Selection

14.5.7. Retrieving an Analyzer

14.5.8. Available Analyzers

Quick Links

Help

Site Info

Related Sites

About

Red Hat legal and privacy links

Red Hat legal and privacy links

Language:

Language and Page Formatting Options

Language:

14.5. Analysis

14.5.1. Default Analyzer and Analyzer by Class

14.5.2. Named Analyzers

14.5.3. Analyzer Definitions

14.5.4. @AnalyzerDef for Solr

14.5.5. Loading Analyzer Resources

14.5.6. Dynamic Analyzer Selection

14.5.7. Retrieving an Analyzer

14.5.8. Available Analyzers

Quick Links

Help

Site Info

Related Sites

Systems Status

About

Red Hat legal and privacy links

Red Hat legal and privacy links