Red Hat Training

A Red Hat training course is available for Red Hat JBoss Data Virtualization

9.3. Text File Sequencer

Text sequencers extract data from text streams. There are separate sequencers for character-delimited sequencing and fixed width sequencing, but both treat the incoming text stream as a series of rows (separated by line-terminators, as defined in BufferedReader .readLine() with each row consisting of one or more columns. As noted above, each text sequencer provides its own mechanism for splitting the row into columns.

9.3.1. Abstract Text Sequencer

When using the Abstract Text Sequencer, the default row factory creates one node in the output location for each row sequenced from the source and adds each column with the row as a child node of the row node. The output graph takes the following form (all nodes have primary type nt:unstructured ):
<graph root jcr:mixinTypes = mode:derived,
                mode:derivedAt="2011-05-13T13:12:03.925Z",
                mode:derivedFrom="/files/foo.dat">
     + text:row[1]
     |   + text:column[1] (jcr:mixinTypes = text:column, text:data = <column1 data>)
     |   + ...
     |   + text:column[n] (jcr:mixinTypes = text:column, text:data = <columnN data>)
     + ...
     + text:row[m]
         + text:column[1] (jcr:mixinTypes = text:column, text:data = <column1 data>)
         + ...
         + text:column[n] (jcr:mixinTypes = text:column, text:data = <columnN data>)

9.3.2. Abstract Text Sequencer Properties

The AbstractTextSequencer class provides a number of JavaBean properties that are common to both of the concrete text sequencer classes:

Table 9.1. Abstract Text Sequencer Properties

Property
Description
commentMarker
Optional property that, if set, indicates that any line beginning with exactly this string should be treated as a comment and should not be processed further. If this value is null, then all lines will be sequenced. The default value for this property is null
maximumLinesToRead
Optional property that, if set, limits the number of lines that will be read during sequencing. Additional lines will be ignored. If this value is non-positive, all lines will be read and sequenced. Comment lines are not counted towards this total. The default value of this property is -1 (indicating that all lines should be read and sequenced).
rowFactoryClassName
Optional property that, if set, provides the fully qualified name of a class that provides a custom implementation of the RowFactory interface. This class must have a no-argument, public constructor. If set, an instance of this class will be created each time that the sequencer sequences an input stream and will be used to provide the output structure of the graph. If this property is set to null, a default implementation will be used. The default value of this property is null.

9.3.3. Delimited Text Sequencer

The Delimited Text Sequencer splits rows into columns based on a regular expression pattern. Although the default pattern is a comma, any regular expression can be provided allowing for more sophisticated splitting patterns.

9.3.4. Delimited Text Sequencer Properties

The DelimitedTextSequencer class provides an additional JavaBean property to override the default regular expression pattern:

Table 9.2. DelimitedTextSequencer properties

Property
Description
splitPattern
Optional property that, if set, sets the regular expression pattern that is used to split each row into columns. This property may not be set to null and defaults to ",".

9.3.5. Using the Delimited Text Sequencer

To use the Delimited Text Sequencer, include the modeshape-sequencer-text JAR in your application and configure the repository to use this sequencer using something similar to:
{
    "name" : "Text Sequencers Test Repository",
    "sequencing" : {
        "removeDerivedContentWithOriginal" : true,
        "sequencers" : [
            {
                "name" : "Delimited text sequencer",
                "classname" : "delimitedtext",
                "pathExpression" : "default:/(*.csv)/jcr:content[@jcr:data] => /delimited",
                "commentMarker" : "#"
            }
        ]
    }
}

9.3.6. Fixed Width Text Sequencer

The Fixed Width Text Sequencer splits rows into columns based on predefined positions. The default setting is to have a single column per row.

9.3.7. Fixed Width Text Sequencer Properties

The FixedWidthTextSequencer class provides an additional JavaBean property to override the default start positions for each column.

Table 9.3. FixedWidthTextSequencer Properties

Property
Description
columnStartPositions
Optional property that, if set, specifies an array of integers where each value represents the start position of each column after the first (the start position for the first column never needs to be specified, since it is always '0'). The default value is an empty array, implying that each row should be treated as a single column. This property may not be set to null.

9.3.8. Using the Fixed Width Text Sequencer

To use the Fixed Width Text Sequencer, include the modeshape-sequencer-text JAR in your application configure the repository to use this sequencer using something similar to:
{
    "name" : "Text Sequencers Test Repository",
    "sequencing" : {
        "removeDerivedContentWithOriginal" : true,
        "sequencers" : {
            "Fixed Width Text Sequencer" : {
                "classname" : "fixedwidthtext",
                "pathExpressions" : [ "default:/(*.txt)/jcr:content[@jcr:data] => /fixed" ],
                "columnStartPositions" : [3,6],
                "commentMarker" : "#"
            }
        }
    }
}