Chapter 4. Mapping entities to the index structure
4.1. Mapping an entity
4.1.1. Basic mapping
@Indexed
(all entities not annotated with @Indexed
will be ignored by the indexing process):
Example 4.1. Making a class indexable using the @Indexed
annotation
@Entity
@Indexed(index="indexes/essays")
public class Essay {
...
}
index
attribute tells Hibernate what the Lucene directory name is (usually a directory on your file system). It is recommended to define a base directory for all Lucene indexes using the hibernate.search.default.indexBase
property in your configuration file. Alternatively you can specify a base directory per indexed entity by specifying hibernate.search.<index>.indexBase,
where <index>
is the fully qualified classname of the indexed entity. Each entity instance will be represented by a Lucene Document
inside the given index (aka Directory).
@Field
does declare a property as indexed. When indexing an element to a Lucene document you can specify how it is indexed:
name
: describe under which name, the property should be stored in the Lucene Document. The default value is the property name (following the JavaBeans convention)store
: describe whether or not the property is stored in the Lucene index. You can store the valueStore.YES
(consuming more space in the index but allowing projection, see Section 5.1.2.5, “Projection” for more information), store it in a compressed wayStore.COMPRESS
(this does consume more CPU), or avoid any storageStore.NO
(this is the default value). When a property is stored, you can retrieve its original value from the Lucene Document. This is not related to whether the element is indexed or not.- index: describe how the element is indexed and the type of information store. The different values are
Index.NO
(no indexing, ie cannot be found by a query),Index.TOKENIZED
(use an analyzer to process the property),Index.UN_TOKENISED
(no analyzer pre processing),Index.NO_NORM
(do not store the normalization data). The default value isTOKENIZED
. - termVector: describes collections of term-frequency pairs. This attribute enables term vectors being stored during indexing so they are available within documents. The default value is TermVector.NO.The different values of this attribute are:
Value Definition TermVector.YES Store the term vectors of each document. This produces two synchronized arrays, one contains document terms and the other contains the term's frequency. TermVector.NO Do not store term vectors. TermVector.WITH_OFFSETS Store the term vector and token offset information. This is the same as TermVector.YES plus it contains the starting and ending offset position information for the terms. TermVector.WITH_POSITIONS Store the term vector and token position information. This is the same as TermVector.YES plus it contains the ordinal positions of each occurrence of a term in a document. TermVector.WITH_POSITIONS_OFFSETS Store the term vector, token position and offset information. This is a combination of the YES, WITH_OFFSETS and WITH_POSITIONS.
Note
@DocumentId
annotation. If you are using Hibernate Annotations and you have specified @Id you can omit @DocumentId. The chosen entity id will also be used as document id.
Example 4.2. Adding @DocumentId
ad @Field
annotations to an indexed entity
@Entity @Indexed(index="indexes/essays") public class Essay { ... @Id @DocumentId public Long getId() { return id; } @Field(name="Abstract", index=Index.TOKENIZED, store=Store.YES) public String getSummary() { return summary; } @Lob @Field(index=Index.TOKENIZED) public String getText() { return text; } }
@DocumentId
ad @Field
annotations to an indexed entity” define an index with three fields: id
, Abstract
and text
. Note that by default the field name is decapitalized, following the JavaBean specification
4.1.2. Mapping properties multiple times
UN_TOKENIZED
. If one wants to search by words in this property and still sort it, one need to index it twice - once tokenized and once untokenized. @Fields allows to achieve this goal.
Example 4.3. Using @Fields to map a property multiple times
@Entity @Indexed(index = "Book" ) public class Book { @Fields( { @Field(index = Index.TOKENIZED), @Field(name = "summary_forSort", index = Index.UN_TOKENIZED, store = Store.YES) } ) public String getSummary() { return summary; } ... }
summary
is indexed twice; once as summary
in a tokenized way, and once as summary_forSort
in an untokenized way. @Field supports 2 attributes useful when @Fields is used:
- analyzer: defines a @Analyzer annotation per field rather than per property
- bridge: defines a @FieldBridge annotation per field rather than per property
4.1.3. Embedded and associated objects
address.city:Atlanta
).
Example 4.4. Using @IndexedEmbedded to index associations
@Entity @Indexed public class Place { @Id @GeneratedValue @DocumentId private Long id; @Field( index = Index.TOKENIZED ) private String name; @OneToOne( cascade = { CascadeType.PERSIST, CascadeType.REMOVE } ) @IndexedEmbedded private Address address; .... } @Entity public class Address { @Id @GeneratedValue private Long id; @Field(index=Index.TOKENIZED) private String street; @Field(index=Index.TOKENIZED) private String city; @ContainedIn @OneToMany(mappedBy="address") private Set<Place> places; ... }
Place
index. The Place
index documents will also contain the fields address.id
, address.street
, and address.city
which you will be able to query. This is enabled by the @IndexedEmbedded
annotation.
@IndexedEmbedded
technique, Hibernate Search needs to be aware of any change in the Place
object and any change in the Address
object to keep the index up to date. To make sure the Place
Lucene document is updated when it's Address
changes, you need to mark the other side of the birirectional relationship with @ContainedIn
.
@ContainedIn
is only useful on associations pointing to entities as opposed to embedded (collection of) objects.
Example 4.5. Nested usage of @IndexedEmbedded
and @ContainedIn
@Entity @Indexed public class Place { @Id @GeneratedValue @DocumentId private Long id; @Field( index = Index.TOKENIZED ) private String name; @OneToOne( cascade = { CascadeType.PERSIST, CascadeType.REMOVE } ) @IndexedEmbedded private Address address; .... } @Entity public class Address { @Id @GeneratedValue private Long id; @Field(index=Index.TOKENIZED) private String street; @Field(index=Index.TOKENIZED) private String city; @IndexedEmbedded(depth = 1, prefix = "ownedBy_") private Owner ownedBy; @ContainedIn @OneToMany(mappedBy="address") private Set<Place> places; ... } @Embeddable public class Owner { @Field(index = Index.TOKENIZED) private String name; ... }
@*ToMany, @*ToOne
and @Embedded
attribute can be annotated with @IndexedEmbedded
. The attributes of the associated class will then be added to the main entity index. In the previous example, the index will contain the following fields
- id
- name
- address.street
- address.city
- address.ownedBy_name
propertyName.
, following the traditional object navigation convention. You can override it using the prefix
attribute as it is shown on the ownedBy
property.
Note
depth
property is necessary when the object graph contains a cyclic dependency of classes (not instances). For example, if Owner
points to Place
. Hibernate Search will stop including Indexed embedded attributes after reaching the expected depth (or the object graph boundaries are reached). A class having a self reference is an example of cyclic dependency. In our example, because depth
is set to 1, any @IndexedEmbedded
attribute in Owner (if any) will be ignored.
@IndexedEmbedded
for object associations allows you to express queries such as:
- Return places where name contains JBoss and where address city is Atlanta. In Lucene query this would be
+name:jboss +address.city:atlanta
- Return places where name contains JBoss and where owner's name contain Joe. In Lucene query this would be
+name:jboss +address.orderBy_name:joe
Note
@Indexed
@ContainedIn
(as seen in the previous example). If not, Hibernate Search has no way to update the root index when the associated entity is updated (in our example, a Place
index document has to be updated when the associated Address
instance is updated).
@IndexedEmbedded
is not the object type targeted by Hibernate and Hibernate Search. This is especially the case when interfaces are used in lieu of their implementation. For this reason you can override the object type targeted by Hibernate Search using the targetElement
parameter.
Example 4.6. Using the targetElement
property of @IndexedEmbedded
@Entity
@Indexed
public class Address {
@Id
@GeneratedValue
@DocumentId
private Long id;
@Field(index= Index.TOKENIZED)
private String street;
@IndexedEmbedded(depth = 1, prefix = "ownedBy_", targetElement = Owner.class)
@Target(Owner.class)
private Person ownedBy;
...
}
@Embeddable
public class Owner implements Person { ... }
4.1.4. Boost factor
@Boost
at the @Field, method or class level.
Example 4.7. Using different ways of increasing the weight of an indexed element using a boost factor
@Entity @Indexed(index="indexes/essays") @Boost(1.7f) public class Essay { ... @Id @DocumentId public Long getId() { return id; } @Field(name="Abstract", index=Index.TOKENIZED, store=Store.YES, boost=@Boost(2f)) @Boost(1.5f) public String getSummary() { return summary; } @Lob @Field(index=Index.TOKENIZED, boost=@Boost(1.2f)) public String getText() { return text; } @Field public String getISBN() { return isbn; } }
Essay
's probability to reach the top of the search list will be multiplied by 1.7. The summary
field will be 3.0 (2 * 1.5 - @Field.boost
and @Boost
on a property are cumulative) more important than the isbn
field. The text
field will be 1.2 times more important than the isbn
field. Note that this explanation in strictest terms is actually wrong, but it is simple and close enough to reality for all practical purposes. Please check the Lucene documentation or the excellent Lucene In Action from Otis Gospodnetic and Erik Hatcher.
4.1.5. Dynamic boost factor
@Boost
annotation used in Section 4.1.4, “Boost factor” defines a static boost factor which is is independent of the state of of the indexed entity at runtime. However, there are usecases in which the boost factor may depends on the actual state of the entity. In this case you can use the @DynamicBoost
annotation together with an accompanying custom BoostStrategy
.
Example 4.8. Dynamic boost example
public enum PersonType { NORMAL, VIP } @Entity @Indexed @DynamicBoost(impl = VIPBoostStrategy.class) public class Person { private PersonType type; // .... } public class VIPBoostStrategy implements BoostStrategy { public float defineBoost(Object value) { Person person = ( Person ) value; if ( person.getType().equals( PersonType.VIP ) ) { return 2.0f; } else { return 1.0f; } } }
VIPBoostStrategy
as implementation of the BoostStrategy
interface to be used at indexing time. You can place the @DynamicBoost
either at class or field level. Depending on the placement of the annotation either the whole entity is passed to the defineBoost
method or just the annotated field/property value. It's up to you to cast the passed object to the correct type. In the example all indexed values of a VIP person would be double as important as the values of a normal person.
Note
BoostStrategy
implementation must define a public no-arg constructor.
@Boost
and @DynamicBoost
annotations in your entity. All defined boost factors are cumulative as described in Section 4.1.4, “Boost factor”.
4.1.6. Analyzer
hibernate.search.analyzer
property. The default value for this property is org.apache.lucene.analysis.standard.StandardAnalyzer
.
Example 4.9. Different ways of specifying an analyzer
@Entity @Indexed @Analyzer(impl = EntityAnalyzer.class) public class MyEntity { @Id @GeneratedValue @DocumentId private Integer id; @Field(index = Index.TOKENIZED) private String name; @Field(index = Index.TOKENIZED) @Analyzer(impl = PropertyAnalyzer.class) private String summary; @Field(index = Index.TOKENIZED, analyzer = @Analyzer(impl = FieldAnalyzer.class) private String body; ... }
EntityAnalyzer
is used to index all tokenized properties (eg. name
), except summary
and body
which are indexed with PropertyAnalyzer
and FieldAnalyzer
respectively.
Important
4.1.6.1. Analyzer definitions
@Analyzer
declarations. An analyzer definition is composed of:
- a name: the unique string used to refer to the definition
- a tokenizer: responsible for tokenizing the input stream into individual words
- a list of filters: each filter is responsible to remove, modify or sometimes even add words into the stream provided by the tokenizer
Tokenizer
starts the analysis process by turning the character input into tokens which are then further processed by the TokenFilter
s. Hibernate Search supports this infrastructure by utilizing the Solr analyzer framework. Make sure to add solr-core.jar and
solr-common.jar
to your classpath to use analyzer definitions. In case you also want to utilizing a snowball stemmer also include the lucene-snowball.jar.
Other Solr analyzers might depend on more libraries. For example, the PhoneticFilterFactory
depends on commons-codec. Your distribution of Hibernate Search provides these dependencies in its lib
directory.
Example 4.10. @AnalyzerDef
and the Solr framework
@AnalyzerDef(name="customanalyzer", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = ISOLatin1AccentFilterFactory.class), @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = StopFilterFactory.class, params = { @Parameter(name="words", value= "org/hibernate/search/test/analyzer/solr/stoplist.properties" ), @Parameter(name="ignoreCase", value="true") }) }) public class Team { ... }
Warning
@AnalyzerDef
annotation. Make sure to think twice about this order.
@Analyzer
declaration using the definition name rather than declaring an implementation class.
Example 4.11. Referencing an analyzer by name
@Entity
@Indexed
@AnalyzerDef(name="customanalyzer", ... )
public class Team {
@Id
@DocumentId
@GeneratedValue
private Integer id;
@Field
private String name;
@Field
private String location;
@Field @Analyzer(definition = "customanalyzer")
private String description;
}
@AnalyzerDef
are available by their name in the SearchFactory
.
Analyzer analyzer = fullTextSession.getSearchFactory().getAnalyzer("customanalyzer");
4.1.6.2. Available analyzers
Table 4.1. Some of the tokenizers available
Factory | Description | parameters |
---|---|---|
StandardTokenizerFactory | Use the Lucene StandardTokenizer | none |
HTMLStripStandardTokenizerFactory | Remove HTML tags, keep the text and pass it to a StandardTokenizer | none |
Table 4.2. Some of the filters available
Factory | Description | parameters |
---|---|---|
StandardFilterFactory | Remove dots from acronyms and 's from words | none |
LowerCaseFilterFactory | Lowercase words | none |
StopFilterFactory | remove words (tokens) matching a list of stop words | words : points to a resource file containing the stop words
ignoreCase: true if
case should be ignore when comparing stop words, false otherwise
|
SnowballPorterFilterFactory | Reduces a word to it's root in a given language. (eg. protect, protects, protection share the same root). Using such a filter allows searches matching related words. | language : Danish, Dutch, English, Finnish, French, German, Italian, Norwegian, Portuguese, Russian, Spanish, Swedish
|
ISOLatin1AccentFilterFactory | remove accents for languages like French | none |
org.apache.solr.analysis.TokenizerFactory
and org.apache.solr.analysis.TokenFilterFactory
in your IDE to see the implementations available.
4.1.6.3. Analyzer discriminator (experimental)
BlogEntry
class for example the analyzer could depend on the language property of the entry. Depending on this property the correct language specific stemmer should be chosen to index the actual text.
AnalyzerDiscriminator
annotation. The following example demonstrates the usage of this annotation:
Example 4.12. Usage of @AnalyzerDiscriminator in order to select an analyzer depending on the entity state
@Entity @Indexed @AnalyzerDefs({ @AnalyzerDef(name = "en", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = EnglishPorterFilterFactory.class ) }), @AnalyzerDef(name = "de", tokenizer = @TokenizerDef(factory = StandardTokenizerFactory.class), filters = { @TokenFilterDef(factory = LowerCaseFilterFactory.class), @TokenFilterDef(factory = GermanStemFilterFactory.class) }) }) public class BlogEntry { @Id @GeneratedValue @DocumentId private Integer id; @Field @AnalyzerDiscriminator(impl = LanguageDiscriminator.class) private String language; @Field private String text; private Set<BlogEntry> references; // standard getter/setter ... }
public class LanguageDiscriminator implements Discriminator { public String getAnanyzerDefinitionName(Object value, Object entity, String field) { if ( value == null || !( entity instanceof Article ) ) { return null; } return (String) value; } }
@AnalyzerDiscriminator
is that all analyzers which are going to be used are predefined via @AnalyzerDef
definitions. If this is the case one can place the @AnalyzerDiscriminator
annotation either on the class or on a specific property of the entity for which to dynamically select an analyzer. Via the impl
parameter of the AnalyzerDiscriminator
you specify a concrete implementation of the Discriminator
interface. It is up to you to provide an implementation for this interface. The only method you have to implement is getAnanyzerDefinitionName()
which gets called for each field added to the Lucene document. The entity which is getting indexed is also passed to the interface method. The value
parameter is only set if the AnalyzerDiscriminator
is placed on property level instead of class level. In this case the value represents the current value of this property.
Discriminator
interface has to return the name of an existing analyzer definition if the analyzer should be set dynamically or null
if the default analyzer should not be overridden. The given example assumes that the language parameter is either 'de' or 'en' which matches the specified names in the @AnalyzerDef
s.
Note
@AnalyzerDiscriminator
is currently still experimental and the API might still change. We are hoping for some feedback from the community about the usefulness and usability of this feature.
4.1.6.4. Retrieving an analyzer
Note
Example 4.13. Using the scoped analyzer when building a full-text query
org.apache.lucene.queryParser.QueryParser parser = new QueryParser( "title", fullTextSession.getSearchFactory().getAnalyzer( Song.class ) ); org.apache.lucene.search.Query luceneQuery = parser.parse( "title:sky Or title_stemmed:diamond" ); org.hibernate.Query fullTextQuery = fullTextSession.createFullTextQuery( luceneQuery, Song.class ); List result = fullTextQuery.list(); //return a list of managed objects
title
and a stemming analyzer is used in the field title_stemmed
. By using the analyzer provided by the search factory, the query uses the appropriate analyzer depending on the field targeted.
searchFactory.getAnalyzer(String)
.