Chapter 20. Text Sequencers

20.1. Text Sequencers

Text sequencers extract data from text streams. There are separate sequencers for character-delimited sequencing and fixed width sequencing, but both treat the incoming text stream as a series of rows (separated by line-terminators, as defined in BufferedReader.readLine() with each row consisting of one or more columns. As noted above, each text sequencer provides its own mechanism for splitting the row into columns.

20.2. Abstract Text Sequencer

When using the AbstractTextSequencer, the default row factory creates one node in the output location for each row sequenced from the source and adds each column with the row as a child node of the row node. The output graph takes the following form (all nodes have primary type nt:unstructured:
 <graph root jcr:mixinTypes = mode:derived, 
                mode:derivedAt="2011-05-13T13:12:03.925Z", 
                mode:derivedFrom="/files/foo.dat">
     + text:row[1]
     |   + text:column[1] (jcr:mixinTypes = text:column, text:data = <column1 data>)
     |   + ...
     |   + text:column[n] (jcr:mixinTypes = text:column, text:data = <columnN data>)
     + ...
     + text:row[m]
         + text:column[1] (jcr:mixinTypes = text:column, text:data = <column1 data>)
         + ...
         + text:column[n] (jcr:mixinTypes = text:column, text:data = <columnN data>)

20.3. Abstract Text Sequencer Properties

For information about configurable properties relating to the Abstract Text Sequencer, refer to the org.modeshape.sequencer.text.AbstractTextSequencer class in the Data Services JavaDoc.

20.4. Delimited Text Sequencer

The DelimitedTextSequencer splits rows into columns based on a regular expression pattern. Although the default pattern is a comma, any regular expression can be provided allowing for more sophisticated splitting patterns.

20.5. Delimited Text Sequencer Properties

For information about configurable properties relating to the Delimited Text Sequencer, refer to the org.modeshape.sequencer.text.DelimitedTextSequencer class in the Data Services JavaDoc.

20.6. Configuring a Delimited Text Sequencer

To use this sequencer, include the modeshape-sequencer-text JAR in your application and configure the JcrConfiguration to use this sequencer using something similar to:
JcrConfiguration config = ...

config.sequencer("Delimited Text Sequencer")
      .usingClass("org.modeshape.sequencer.text.DelimitedTextSequencer")
      .loadedFromClasspath()
      .setDescription("Sequences delimited files to extract values")
      .sequencingFrom("//(*.(txt)[*])/jcr:content[@jcr:data]")
      .setProperty("splitPattern", "|")
      .andOutputtingTo("/txt/$1");

20.7. Fixed Width Text Sequencer

The FixedWidthTextSequencer splits rows into columns based on predefined positions. The default setting is to have a single column per row.

20.8. Fixed Width Text Sequencer Properties

For information about configurable properties relating to the Fixed Width Text Sequencer, refer to the org.modeshape.sequencer.text.FixedWidthTextSequencer class in the Data Services JavaDoc.

20.9. Configuring a Fixed Width Text Sequencer

  1. Include the relevant libraries

    Include modeshape-sequencer-text-VERSION.jar in your application.
  2. Choose one of the following for sequencing configuration

    • Define sequencing configuration based on standard example provided in SOA-ROOT/eds/modeshape/resources/modeshape-config-standard.xml:
      <mode:sequencer jcr:name="Fixed Width Text File Sequencer" mode:classname="org.modeshape.sequencer.text.FixedWidthTextSequencer">
        <mode:description>
          Sequences *.txt fixed-width text files loaded under '/files', extracting splitting rows into columns based on predefined positions.
        </mode:description>
        <mode:pathExpression>
          eds-store:default:/files//(*.txt[*])/jcr:content[@jcr:data] => eds-store:default:/sequenced/text/fixedWidth/$1
        </mode:pathExpression>
        <mode:columnStartPositions/>
      </mode:sequencer>
      

      Note

      The columnStartPositions property defines the 0-based column start positions. Everything before the first start position is treated as the first column. The default value is the empty string (implying that each row should be treated as a single column). There is an implicit column start position of 0 that never needs to be specified.
    • Configure via org.modeshape.jcr.JcrConfiguration:
      JcrConfiguration config = ...
      
      config.sequencer("Fixed Width Text Sequencer")
            .usingClass("org.modeshape.sequencer.text.FixedWidthTextSequencer")
            .loadedFromClasspath()
            .setDescription("Sequences *.txt fixed-width text files loaded under '/files', extracting splitting rows into columns based on predefined positions.")
            .sequencingFrom("/files//(*.txt[*])/jcr:content[@jcr:data]")
            .setProperty("columnStartPositions", "3,6,15")
            .andOutputtingTo("/sequenced/text/fixedWidth/$1");

    Note

    Refer to SOA-ROOT/eds/modeshape/resources/modeshape-config-standard.xml for more information.