Red Hat Training

A Red Hat training course is available for Red Hat JBoss Data Virtualization

15.3. Custom Text Extractors

15.3.1. The Text Extraction Framework

A text extractor is actually a plain old Java object (POJO). To create an extractor, you create a Java class that extends a single abstract class, called TextExtractor :
package org.modeshape.jcr.api.text;

import javax.jcr.Node;
import javax.jcr.Property;
import javax.jcr.RepositoryException;

public abstract class TextExtractor {

    ...

    /**
     * Determine if this extractor is capable of processing content with the supplied MIME type.
     * @param mimeType the MIME type; never null
     * @return true if this extractor can process content with the supplied MIME type, or false otherwise.
     */
    public abstract boolean supportsMimeType( String mimeType );

    /**
     * Extract text from the given {@link Binary}, using the given output to record the results.
     * @param binary the binary value that can be used in the extraction process; never <code>null</code>
     * @param output the output from the sequencing operation; never <code>null</code>
     * @param context the context for the sequencing operation; never <code>null</code>
     * @throws Exception if there is a problem during the extraction process
     */
    public abstract void extractFrom( Binary binary,
                                      TextExtractor.Output output,
                                      Context context ) throws Exception;

    /**
     * Allows subclasses to process the stream of binary value property in "safe" fashion, making sure the stream is closed at the
     * end of the operation.
     * @param binary a {@link org.modeshape.jcr.api.Binary} who is expected to contain a non-null binary value.
     * @param operation a {@link org.modeshape.jcr.api.text.TextExtractor.BinaryOperation} which should work with the stream
     * @param <T> the return type of the binary operation
     * @return whatever type of result the stream operation returns
     * @throws Exception if there is an error processing the stream
     */
    protected final <T> T processStream( Binary binary,
                                         BinaryOperation<T> operation ) throws Exception {
        ...
    }

    /**
     * Interface which can be used by subclasses to process the input stream of a binary property.
     * @param <T> the return type of the binary operation
     */
    protected interface BinaryOperation<T> {
        T execute( InputStream stream ) throws Exception;
    }

    /**
     * Interface which provides additional information to the text extractors, during the extraction operation.
     */
    public interface Context {
        String mimeTypeOf( String name,
                           Binary binaryValue ) throws RepositoryException, IOException;
    }

    /**
     * The interface passed to a TextExtractor to which the extractor should record all text content.
     */
    public interface Output {
        /**
         * Record the text as being extracted. This method can be called multiple times during a single extract.
         * @param text the text extracted from the content.
         */
        void recordText( String text );
    }
}
The abstract class also contains fields and getters (not shown above) for the name and logger that are automatically set by the hierarchical database during repository initialization.
There are two abstract methods that must be implemented: supportsMimeType(...) and extractFrom(...) . The first is fairly obvious: return true for all of the MIME types for which the extractor is capable of processing. The extractFrom method is the meat of the implementation, and should process the BINARY value's contents and write the searchable text to the supplied Output object.
Note that the processStream(...) method is a utility that can be called by the extractFrom and that properly opens the BINARY value's stream, processes the content, and ensures that the stream is always closed. Your implementation can therefore implement the extractFrom method as follows:
public void extractFrom( final Binary binary,
                         final TextExtractor.Output output,
                         final Context context ) throws Exception {
    processStream(binary, new BinaryOperation<Object>() {
        @Override
        public Object execute( InputStream stream ) throws Exception {
            // Custom logic to read the stream and write to 'output'
            return null;
        }
    });
}
This can make your implementation a little easier, but feel free to implement the extractFrom method directly process the stream.