Chapter 7. Managing Indexes

Indexing makes searching for and retrieving information easier by classifying and organizing attributes or values. This chapter describes the searching algorithm itself, placing indexing mechanisms in context, and then describes how to create, delete, and manage indexes.

7.1. About Indexes

This section provides an overview of indexing in Directory Server. It contains the following topics:

7.1.1. About Index Types

Indexes are stored in files in the directory's databases. The names of the files are based on the indexed attribute, not the type of index contained in the file. Each index file may contain multiple types of indexes if multiple indexes are maintained for the specific attribute. For example, all indexes maintained for the common name attribute are contained in the cn.db4 file.
Directory Server supports the following types of index:
  • Presence index (pres) contains a list of the entries that contain a particular attribute, which is very useful for searched. For example, it makes it easy to examine any entries that contain access control information. Generating an aci.db4 file that includes a presence index efficiently performs the search for ACI=* to generate the access control list for the server.
    The presence index is not used for base object searches.
  • Equality index (eq) improves searches for entries containing a specific attribute value. For example, an equality index on the cn attribute allows a user to perform the search for cn=Babs Jensen far more efficiently.
  • Approximate index (approx) is used for efficient approximate or sounds-like searches. For example, an entry may include the attribute value cn=Robert E Lee. An approximate search would return this value for searches against cn~=Robert Lee, cn~=Robert, or cn~=Lee. Similarly, a search against l~=San Fransisco (note the misspelling) would return entries including l=San Francisco.
  • Substring index (sub) is a costly index to maintain, but it allows efficient searching against substrings within entries. Substring indexes are limited to a minimum of three characters for each entry.
    For example, searches of the form cn=*derson , match the common names containing strings such as Bill Anderson, Jill Henderson, or Steve Sanderson. Similarly, the search for telephoneNumber= *555* returns all the entries in the directory with telephone numbers that contain 555.
  • International index speeds up searches for information in international directories. The process for creating an international index is similar to the process for creating regular indexes, except that it applies a matching rule by associating an object identifier (OID) with the attributes to be indexed.
    The supported locales and their associated OIDs are listed in Appendix C, Internationalization. If there is a need to configure the Directory Server to accept additional matching rules, contact Red Hat Professional Services.
  • Browsing index, or virtual list view (VLV) index, speeds up the display of entries in the Directory Server Console. This index is particularly useful if a branch of your directory contains hundreds of entries; for example, the ou=people branch. You can create a browsing index on any branch point in the directory tree to improve display performance through the Directory Server Console or by using the vlvindex command-line tool, which is explained in the Directory Server Configuration and Command-Line Tool Reference.

7.1.2. About Default, System, and Standard Indexes

When you install Directory Server, a set of default and system indexes is created per database instance. To maintain these indexes, the directory uses standard indexes.

7.1.2.1. Overview of Default Indexes

The default indexes can be modified depending on the directory indexing needs. Always ensure that no server plug-ins or other servers depend on a default index before removing it.
Table 7.1, “Default Indexes” lists the default indexes installed with the directory.

Table 7.1. Default Indexes

Attribute Eq Pres Sub Purpose
cn Improves the performance of the most common types of user directory searches.
givenname Improves the performance of the most common types of user directory searches.
mail Improves the performance of the most common types of user directory searches.
mailHost Used by a messaging server.
member Improves Directory Server performance. This index is also used by the Referential Integrity Plug-in. See Section 3.5, “Maintaining Referential Integrity” for more information.
owner Improves Directory Server performance. This index is also used by the Referential Integrity Plug-in. See Section 3.5, “Maintaining Referential Integrity” for more information.
see Also Improves Directory Server performance. This index is also used by the Referential Integrity Plug-in. See Section 3.5, “Maintaining Referential Integrity” for more information.
sn Improves the performance of the most common types of user directory searches.
telephoneNumber Improves the performance of the most common types of user directory searches.
uid Improves Directory Server performance.
unique member Improves Directory Server performance. This index is also used by the Referential Integrity Plug-in. See Section 3.5, “Maintaining Referential Integrity” for more information.

7.1.2.2. Overview of System Indexes

System indexes cannot be deleted or modified. They are required by the directory to function properly. Table 7.2, “System Indexes” lists the system indexes included with the directory.

Table 7.2. System Indexes

Attribute Eq Pres Purpose
aci Allows the Directory Server to quickly obtain the access control information maintained in the database.
objectClass Used to help accelerate subtree searches in the directory.
entryDN Speeds up entry retrieval based on DN searches.
parentID Enhances directory performance during one-level searches.
numSubordinates Used by the Directory Server Console to enhance display performance on the Directory tab.
nsUniqueID Used to search for specific entries.

7.1.2.3. Overview of Standard Indexes

Because of the need to maintain default indexes and other internal indexing mechanisms, the Directory Server also maintains certain standard index files. The standard index, id2entry.db4, exists by default in Directory Server; you do not need to generate it.
The id2entry.db4 contains the actual directory database entries. All other database files can be recreated from this one.

7.1.3. Overview of the Searching Algorithm

Indexes are used to speed up searches. To understand how the directory uses indexes, it helps to understand the searching algorithm. Each index contains a list of attributes (such as the cn, common name, attribute) and a pointer to the entries corresponding to each value. Directory Serverprocesses a search request as follows:
  1. An LDAP client application sends a search request to the directory.
  2. The directory examines the incoming request to make sure that the specified base DN matches a suffix contained by one or more of its databases or database links.
    • If they do match, the directory processes the request.
    • If they do not match, the directory returns an error to the client indicating that the suffix does not match. If a referral has been specified in the nsslapd-referral attribute under cn=config, the directory also returns the LDAP URL where the client can attempt to pursue the request.
    • If the search request for each database attribute can be satisfied by a single index, then the server reads that index to generate a list of potential matches.
    • If there is no index for the attribute, the directory generates a candidate list that includes all entries in the database, which makes the search considerably slower.
    • If a search request contains multiple attributes, the directory consults multiple indexes and then combines the resulting lists of candidate entries.
    • If there is an index for the attribute, the directory takes the candidate matches from the index files in the form of a series of entry ID numbers.
  3. The directory uses the returned entry ID numbers to read the corresponding entries from the id2entry.db4 file. The Directory Server then examines each of the candidate entries to see if any match the search criteria. The directory returns matching entries to the client as each is found.
    The directory continues until either it has examined all candidate entries or it reaches the limit set in one of the following attributes:
    • nsslapd-sizelimit which specifies the maximum number of entries to return from a search operation. If this limit is reached, the directory returns any entries it has located that match the search request, as well as an exceeded size limit error.
    • nsslapd-timelimit which specifies the maximum number of seconds allocated for a search request. If this limit is reached, the directory returns any entries it has located that match the search request, as well as an exceeded time limit error.
    • nsslapd-lookthroughlimit, which specifies the maximum number of entries that the directory will check when examining candidate entries in response to a search request.
    • nsslapd-idlistscanlimit which specifies the maximum number of entries in an ID list before the list is considered to equal the entire database.
    See Directory Server Configuration and Command-Line Tool Reference for further information about these attributes.

7.1.4. Approximate Searches

In addition, the directory uses a variation of the metaphone phonetic algorithm to perform searches on an approximate index. Each value is treated as a sequence of words, and a phonetic code is generated for each word.

NOTE

The metaphone phonetic algorithm in Directory Server supports only US-ASCII letters. Therefore, use approximate indexing only with English values.
Values entered on an approximate search are similarly translated into a sequence of phonetic codes. An entry is considered to match a query if both of the following are true:
  • All of the query string codes match the codes generated in the entry string.
  • All of the query string codes are in the same order as the entry string codes.
Name in the Directory (Phonetic Code) Query String (Phonetic code) Match Comments
Alice B Sarette (ALS B SRT) Alice Sarette (ALS SRT) Matches. Codes are specified in the correct order.
Alice Sarrette (ALS SRT) Matches. Codes are specified in the correct order, despite the misspelling of Sarette.
Surette (SRT) Matches. The generated code exists in the original name, despite the misspelling of Sarette.
Bertha Sarette (BR0 SRT) No match. The code BR0 does not exist in the original name.
Sarette, Alice (SRT ALS) No match. The codes are not specified in the correct order.

7.1.5. Indexing Performance

Each index that the directory uses is composed of a table of index keys and matching entry ID lists. This entry ID list is used by the directory to build a list of candidate entries that may match a client application's search request; Section 7.1, “About Indexes” describes each kind of Directory Server index. The Directory Server secondary index structure greatly improves write and search operations.

7.1.5.1. Indexing Performance

While achieving extremely high read performance, in previous versions of Directory Server, write performance was limited by the number of bytes per second that could be written into the storage manager's transaction log file. Large log files were generated for each LDAP write operation; in fact, log file verbosity could easily be 100 times the corresponding number of bytes changed in the Directory Server. The majority of the contents in the log files are related to index changes (ID insert and delete operations).
The secondary index structure was separated into two levels in the old design:
  • The ID list structures, which were the province of the Directory Server backend and opaque to the storage manager.
  • The storage manager structures (Btrees), which were opaque to the Directory Server backend code.
Because it had no insight into the internal structure of the ID lists, the storage manager had to treat ID lists as opaque byte arrays. From the storage manager's perspective, when the content of an ID list changed, the entire list had changed. For a single ID that was inserted or deleted from an ID list, the corresponding number of bytes written to the transaction log was the maximum configured size for that ID list, about 8 kilobytes. Also, every database page on which the list was stored was marked as dirty, since the entire list had changed.
In the redesigned index, the storage manager has visibility into the fine-grain index structure, which optimizes transaction logging so that only the number of bytes actually changed need to be logged for any given index modification. The Berkeley DB provides ID list semantics, which are implemented by the storage manager. The Berkeley API was enhanced to support the insertion and deletion of individual IDs stored against a common key, with support for duplicate keys, and an optimized mechanism for the retrieval of the complete ID list for a given key.
The storage manager has direct knowledge of the application's intent when changes are made to ID lists, resulting in several improvements to ID list handling:
  • For long ID lists, the number of bytes written to the transaction log for any update to the list is significantly reduced, from the maximum ID list size (8 kilobytes) to twice the size of one ID (4 bytes).
  • For short ID lists, storage efficiency, and in most cases performance, is improved because only the storage manager meditate need to be stored, not the ID list metadata.
  • The average number of database pages marked as dirty per ID insert or delete operation is very small because a large number of duplicate keys will fit into each database page.

7.1.5.2. Search Performance

For each entry ID list, there is a size limit that is globally applied to all index keys managed by the server. In previous versions of Directory Server, this limit was called the All IDs Threshold. Because maintaining large ID lists in memory can affect performance, the All IDs Threshold set a limit on how large a single entry ID list could get. When a list hit a certain pre-determined size, the search would treat it as if the index contained the entire directory.
The difficulty in setting the All IDs Threshold hurt performance. If the threshold was too low, too many searches examined every entry in the directory. If it was too high, too many large ID lists had to be maintained in memory.
The problems addressed by the All IDs Threshold are no longer present because of the efficiency of entry insertion, modification, and deletion in the Berkeley DB design. The All IDs Threshold is removed for database write operations, and every ID list is now maintained accurately.
Since loading a long ID list from the database can significantly reduce search performance, the configuration parameter, nsslapd-idlistscanlimit, sets a limit on the number of IDs that are read before a key is considered to match the entire primary index. nsslapd-idlistscanlimit is analogous to the All IDs Threshold, but it only applies to the behavior of the server's search code, not the content of the database.
When the server uses indexes in the processing of a search operation, it is possible that one index key matches a large number of entries. For example, consider a search for objectclass=inetorgperson in a directory that contained one million inetorgperson entries. When the server reads the inetorgperson index key in the objectclass index, it finds one million matching entries. In cases like this, it is more efficient simply to conclude in the index lookup phase of the search operation processing that all the entries in the database match the query. This causes subsequent search processing to scan the entire database content, checking each entry as to whether it matches the search filter. The time required to fetch the index keys is not worthwhile; the search operation is likely to be processed more efficiently by omitting the index lookup.
In Directory Server, when examining an index, if more than a certain number of entries are found, the server stops reading the index and marks the search as unindexed for that particular index.
The threshold number of entries is called the idlistscanlimit and is configured with the nsslapd-idlistscanlimit configuration attribute. The default value is 4000, which is designed to give good performance for a common range of database sizes and access patterns. Typically, it is not necessary to change this value. However, in rare circumstances it may be possible to improve search performance with a different value. For example, lowering the value will improve performance for searches that will otherwise eventually hit the default limit of 4000. This might reduce performance for other searches that benefit from indexing. Conversely, increasing the limit could improve performance for searches that were previously hitting the limit. With a higher limit, these searches could benefit from indexing where previously they did not.
For more information on search limits for the server, refer to Section 7.1.3, “Overview of the Searching Algorithm”.

7.1.5.3. Backwards Compatibility and Migration

While current versions of Directory Server can support the old database design, only the new design is supported for this and later releases of Directory Server.
Upon startup, the server will read the database version from the DBVERSION file, which contains the text Netscape-ldbm/6.2 (old database version), Netscape-ldbm/7.1 (new database format), or bdb/4.2/libback-ldbm (new database format). If the file indicates that the old format is used, then the old code is selected for the database. Because the DBVERSION file stores everything per-backend, it is possible to have different database formats for different individual backends, but the old database format is not recommended.
All databases must be migrated to Directory Server 8.2 when the system is upgraded. Migration is supported for Directory Server 6.x versions, and for releases earlier than version 6.x, dump the databases be dumped, and install Directory Server fresh. Migrating databases is covered in the Directory Server Installation Guide.
Also, the index sizes can be larger than in older releases, so you may want to increase your database cache size. To reconfigure your cache size, look up the nsslapd-dbcachesize entry in the Directory Server Configuration and Command-Line Tool Reference.

7.1.6. Balancing the Benefits of Indexing

Before creating new indexes, balance the benefits of maintaining indexes against the costs.
  • Approximate indexes are not efficient for attributes commonly containing numbers, such as telephone numbers.
  • Substring indexes do not work for binary attributes.
  • Equality indexes should be avoided if the value is big (such as attributes intended to contain photographs or passwords containing encrypted data).
  • Maintaining indexes for attributes not commonly used in a search increases overhead without improving global searching performance.
  • Attributes that are not indexed can still be specified in search requests, although the search performance may be degraded significantly, depending on the type of search.
  • The more indexes you maintain, the more disk space you require.
Indexes can become very time-consuming. For example:
  1. The Directory Server receives an add or modify operation.
  2. The Directory Server examines the indexing attributes to determine whether an index is maintained for the attribute values.
  3. If the created attribute values are indexed, then the Directory Server generates the new index entries.
  4. Once the server completes the indexing, the actual attribute values are created according to the client request.
For example, the Directory Server adds the entry:
dn: cn=John Doe,ou=People,dc=example,dc=com
objectclass: top
objectClass: person
objectClass: orgperson
objectClass: inetorgperson
cn: John Doe
cn: John
sn: Doe
ou: Manufacturing
ou: people
telephoneNumber: 408 555 8834
description: Manufacturing lead for the Z238 line of widgets.
The Directory Server is maintaining the following indexes:
  • Equality, approximate, and substring indexes for cn (common name) and sn (surname) attributes.
  • Equality and substring indexes for the telephone number attribute.
  • Substring indexes for the description attribute.
When adding that entry to the directory, the Directory Server must perform these steps:
  1. Create the cn equality index entry for John and John Doe.
  2. Create the appropriate cn approximate index entries for John and John Doe.
  3. Create the appropriate cn substring index entries for John and John Doe.
  4. Create the sn equality index entry for Doe.
  5. Create the appropriate sn approximate index entry for Doe.
  6. Create the appropriate sn substring index entries for Doe.
  7. Create the telephone number equality index entry for 408 555 8834.
  8. Create the appropriate telephone number substring index entries for 408 555 8834.
  9. Create the appropriate description substring index entries for Manufacturing lead for the Z238 line of widgets. A large number of substring entries are generated for this string.
As this example shows, the number of actions required to create and maintain databases for a large directory can be resource-intensive.