Chapter 12. JPA Connector

12.1. The JPA Connector

This connector stores a graph of any structure or size in a relational database, using a JPA provider on top of a JDBC driver. Currently this connector relies upon some Hibernate-specific capabilities. The schema of the database is dictated by this connector and is optimized for storing a graph structure. (In other words, this connector does not expose as a graph the data in an existing database with an arbitrary schema.)

12.2. JPA Connector Properties

The JpaSource class provides a number of JavaBean properties that control its behavior. For more information about these properties, refer to org.modeshape.connector.store.jpa.JpaSource in the Data Services JavaDoc.

12.3. Configuring a JPA Connector

One way to configure the JPA connector is to create JcrConfiguration instance with a repository source that uses the JpaSource class. For example:
JcrConfiguration config = ...
config.repositorySource("JPA Store")
      .usingClass(JpaSource.class)
      .setDescription("The database store for our content")
      .setProperty("dataSourceJndiName", "java:/MyDataSource")
      .setProperty("defaultWorkspaceName", "My Default Workspace")
      .setProperty("autoGenerateSchema", "validate");
Of course, setting other more advanced properties would entail calling setProperty(...) for each. Since almost all of the properties have acceptable default values, however, we don't need to set very many of them.
Another way to configure the JPA connector is to create JcrConfiguration instance and load an XML configuration file that contains a repository source that uses the JpaSource class. For example a file named configRepository.xml can be created with these contents:
<?xml version="1.0" encoding="UTF-8"?>
<configuration xmlns:mode="http://www.modeshape.org/1.0" xmlns:jcr="http://www.jcp.org/jcr/1.0">
    <!-- 
    Define the sources for the content.  These sources are directly accessible using the 
    ModeShape-specific Graph API.  In fact, this is how the ModeShape JCR implementation works.  You 
    can think of these as being similar to JDBC DataSource objects, except that they expose 
    graph content via the Graph API instead of records via SQL or JDBC. 
    -->
    <mode:sources jcr:primaryType="nt:unstructured">
        <!-- 
        The 'JPA Store' repository is an JPA source with a single default workspace (though 
        others could be created, too).
        -->
        <mode:source jcr:name="JPA Store" 
                    mode:classname="org.modeshape.connector.store.jpa.JpaSource"
                    mode:description="The database store for our content"
                    mode:dataSourceJndiName="java:/MyDataSource"
                    mode:defaultWorkspaceName="default"
                    mode:autoGenerateSchema="validate"/>    
    </mode:sources>
    
	<!-- MIME type detectors and JCR repositories would be defined below --> 
</configuration>
The configuration can then be loaded from Java like this:
JcrConfiguration config = new JcrConfiguration().loadFrom("/configRepository.xml");

12.4. Simple Model

This database schema model stores node properties as opaque records in the same row as transparent values like the node's namespace, local name, and same-name-sibling index. Large property values are stored separately.
The set of tables used in this model includes:
  • Workspaces - the set of workspaces and their names.
  • Namespaces - the set of namespace URIs used in paths, property names, and property values.
  • Nodes - the nodes in the repository, where each node and its properties are represented by a single record. This approach makes it possible to efficiently work with nodes containing large numbers of children, where adding and removing child nodes is largely independent of the number of children. Since the primary consumer of ModeShape graph information is the JCR layer, and the JCR layer always retrieves the nodes' properties for retrieved nodes, the properties have been moved in-row with the nodes. Properties are still store in an opaque, serialized (and optionally compressed) form.
  • Large values - property values larger than a certain size will be broken out into this table, where they are tracked by their SHA-1 has and shared by all properties that have that same value. The values are stored in a binary (and optionally compressed) form.
  • Subgraph - a working area for efficiently computing the space of a subgraph; see below
  • Options - the parameters for this store's configuration (common to all models)
This database model contains two tables that are used in an efficient mechanism to find all of the nodes in the subgraph below a certain node. This process starts by creating a record for the subgraph query, and then proceeds by executing a join to find all the children of the top-level node, and inserting them into the database (in a working area associated with the subgraph query). Then, another join finds all the children of those children and inserts them into the same working area. This continues until the maximum depth has been reached, or until there are no more children (whichever comes first). All of the nodes in the subgraph are then represented by records in the working area, and can be used to quickly and efficient work with the subgraph nodes. When finished, the mechanism deletes the records in the working area associated with the subgraph query.
This subgraph query mechanism is extremely efficient, performing one join/insert statement per level of the subgraph, and is completely independent of the number of nodes in the subgraph. For example, consider a subgraph of node A, where A has 10 children, and each child contains 10 children, and each grandchild contains 10 children. This subgraph has a total of 1111 nodes (1 root + 10 children + 10*10 grandchildren + 10*10*10 great-grandchildren). Finding the nodes in this subgraph would normally require 1 query per node (in other words, 1111 queries). But with this subgraph query mechanism, all of the nodes in the subgraph can be found with 1 insert plus 4 additional join/inserts.
This mechanism has the added benefit that the set of nodes in the subgraph are kept in a working area in the database, meaning they don't have to be pulled into memory.
In the Simple model, subgraph queries are used to efficiently process a number of different requests, including ReadBranchRequest and DeleteBranchRequest. Processing each of these kinds of requests requires knowledge of the subgraph, and in fact all but the ReadBranchRequest need to know the complete subgraph.

Warning

Most DBMS systems have built-in sizes for LOB columns (although many allow DB admins to control the size), and thus do not require any special consideration. However, Apache Derby and IBM DB2 require explicit sizes on LOB columns. Currently, the ModeShape database schema has two such columns: the MODE_SIMPLE_NODE.DATA and MODE_LARGE_VALUES.DATA columns. The sizes of these columns are sufficiently large (1MB and 1GB, respectively), but attempts to store larger values than these sizes will fail.
Therefore, when using IBM DB2 and Apache Derby, determine the appropriate size of these columns for your environment. For production systems, ModeShape recommends using the DDL generation utility (provided with ModeShape, see above) to generate the DDL for your particular DBMS, and its very easy to adjust that file to specify alternative sizes for the two columns. Alternatively, database administrators can alter the two tables by increasing the size of these columns.
Other databases do not seem to be affected by this issue.