Red Hat Training

A Red Hat training course is available for Red Hat JBoss Data Virtualization

3.2. Federation

Previously, the hierarchical database owned all of its own data. It stored all of the information about all nodes within an Infinispan cache, and the repository had a single binary store used to persist all BINARY values.
Conventional Repository

Figure 3.1. Conventional Repository

However, in this release, the hierarchical database provides the ability for clients to access data stored both internally and externally using the single JCR API.
Federated Repository

Figure 3.2. Federated Repository

An external system is a system outside of the hierarchical database that owns its own data and that the hierarchical database interacts with to access (and optionally update) that data. The external system might be a data store, or it might be a service that dynamically produces data. Examples of external systems are Oracle 11i, Cassandra, MongoDB, Git, SVN, SAP, file systems, CMIS, RPM repositories, and JCR repositories.
Whereas an external system is a kind of software system, we use the term external source to describe an addressable instance or installation of the external system. For example, external sources might include a particular database instance, a particular Git repository, a particular file system on a specific machine, or a particular instance of a CMIS repository.
In the diagram above, two external sources are shown and labeled "External Source A" and "External Source B". (But the diagram does not define what kind of system they are.)
A hierarchical database connector is the software used to interact with a specific kind of external system. A connector consists of compiled Java classes and resources, and is usually packaged as a JAR with dependencies on 3rd party libraries. The hierarchical database defines a connector SPI (or Service Provider Interface) which the connector must implement. Generally connectors can read and update data in the external system, although a connector implementation may support only read operations.
To be useful, however, a connector must be instantiated and that instance configured to talk to a specific external source . Then that connector instance's job is to create a single, virtual tree of nodes that represents the data in the external source. Note that the connector does not create the entire tree up front; instead, the connector creates the nodes in that virtual tree only when the hierarchical database asks for them. Thus, the potential tree of nodes for a given source might be massive, but only the nodes being used will be materialized.
The diagram of the federated repository shown above includes two connector instances , each of which is configured to talk to one of the external sources.
An internal node is any node within a hierarchical database repository that is owned by the database and stored within the Infinispan cache. In a regular repository (without federation), all nodes are internal nodes.
An external node is any node within a federated hierarchical database repository that is not owned by the database but instead is dynamically generated to represent some portion of data in an external source. Clients view internal and external nodes in exactly the same way, but internally they are handled in different ways.
A federated node is an internal node that contains some children that are external nodes. In other words, only federated nodes can have internal nodes and external nodes as children, whereas internal nodes can only have other internal nodes as children and external nodes can only have other external nodes as children.
A projection is a portion of the repository (really a subgraph) whose nodes are all external nodes that are representations of some of the data in an external source. The nodes are dynamically generated (by the connector's logic) as needed, and can optionally be cached for a configurable amount of time.
The federated repository diagram above shows three projections, labeled "Projection 1", "Projection 2", and "Projection 3". Strictly speaking, projections do not have a name, so the labels are merely for discussion purposes. Note how projections 1 and 2 both project external nodes from "External Source A", whereas projection 3 only projects the external nodes from "External Source B". We often will talk about an external source as having one or more projections; thus "External Source A" has two projections ("Projection 1" and "Projection 2"), while "External Source B" has only one projection ("Projection 3").
Each projection maps a specific subtree of the virtual tree (created by a connector talking to an external source) underneath a specific federated node. A simple path is used to identify the subtree of external nodes, and a simple path is used to identify the federated node. The hierarchical database uses a projection expression that is a string with these two paths:
<workspace-name>':' <path-to-federated-node> '=>' <path-in-external-source-of-node>
where
  • <workspace-name> is the name of the workspace where the projection is to be placed
  • <path-to-federated-node> is a regular absolute path to the existing internal node under which the external nodes are to appear
  • <path-in-external-source-of-node is a regular absolute path in the virtual tree created by the connector of the node whose children are to appear as children of the federated node.
Projections can be defined within a repository's configuration (making them available immediately upon startup of the repository) or programmatically added or removed by client applications using the FederationManager interface.
The hierarchical database public API includes the org.modeshape.jcr.api.federation.FederationManager interface that defines several methods for programmatically creating and removing projections. Note that at this time it is not possible to programmatically create, modify, or remove external sources, so these must be defined within the repository configuration.
As clients navigate the nodes in the repository, they typically ask for one (or multiple) children of a particular node. Clients repeat this process until they access the node(s) they're looking for.
The hierarchical database performs these operations differently depending upon the kind of node:
  • If the parent is an internal node , then all children will also be internal. Therefore, to find a particular child by name, the hierarchical database obtains the parent's child reference to obtain the child's node key, and then looks up the node with that key in the Infinispan cache. This is the "conventional" behavior, and this incurs no overhead even when the repository is configured to use federation.
  • If the parent is a federated node , then the process is very similar to internal nodes, except that the internal and external child references are managed separately. The hierarchical database then looks at the child's node key to determine (from the key itself) if the child exists in the Infinispan cache or in an external source. If in an external source, the hierarchical database then calls to the connector to ask for the representation of the requested node.
  • If the parent is an external node , then the hierarchical database obtains the parent's child reference and looks up the node with that key in the same connector. The connector then generates a representation of the requested node.
All nodes (both internal and external) can be accessed by Session.getNodeByIdentifier(String , where the identifier is the same string returned by calling the getIdentifier() method on the node. The hierarchical database can tell from the identifier whether it is for an external node, and if so it will look up the node in the connector.

Note

Per the JCR specification, clients should treat these identifiers as opaque. In fact, hierarchical database identifiers follow a fairly complex pattern that will likely be difficult to reverse engineer, and which may change at any time.
The hierarchical database actually uses an in-memory LIRS cache of the nodes. So, although the navigation and lookup steps mentioned above do not discuss using the LIRS cache, the hierarchical database always consults this cache when it needs to find a node with a particular node key. If found in the cache, the node will be used. If the cache does not contain the node, then it will consult the Infinispan cache or the connector to obtain (and cache) the node.
Normally, nodes in the LIRS cache are evicted after a certain (but configurable) time. However, external nodes can have an additional internal property that specifies the maximum time that the node can be in the cache. Or, an external source can be configured with a global time to live value. Either way, the LIRS cache ensures that the nodes are evicted at the appropriate time.
Of course, a node is also evicted from the cache if the node has been changed and persisted (e.g., via Session.save() or user transaction commit), even if that change was made on a different process in the cluster.
A connector decides which external nodes are to be indexes.
  • The connector instance can be configured with a queryable boolean parameter that states whether any of the content is to be queryable. This defaults to true .
  • The connector can mark any or all nodes as not queryable .
Thus, even though a connector implementation may be written such that some or all of the external nodes can be queried, a repository configuration can configure an instance of that connector and override the behavior so that no nodes are queryable.

Note

If a connector is implemented by marking all nodes as not queryable , then configuring an instance of that connector with queryable=true has no effect.
Any nodes that are queryable will be included in the index, as long as the hierarchical database is notified of new nodes. By default, external nodes are not automatically indexed. To index them, use the public API for reindexing.
Once indexed, the nodes can be queried like any other nodes.
A connector works by creating a node representation of external data, and that node contains the references to the node's children. These references are relatively small (the ID and name of the child), and for many connectors this is sufficient and fast enough. However, when the number of children under a node starts to increase, building the list of child references for a parent node can become noticeable and even burdensome, especially when few (if any) of the child references may ultimately be resolved into nodes because no client actually uses those references.
A pageable connector is one that will expose the children of nodes in a "page by page" fashion, where the parent node only contains the first page of child references and subsequent pages are loaded only if needed. This turns out to be quite effective, since when clients navigate a specific path (or ask for a specific child of a parent by its name) the hierarchical database does not need to use the child references in a node's document and can instead have the connector resolve such (relative or absolute external) paths into an identifier and then ask for the document with that ID.
Therefore, the only time the child references are needed are when clients iterate over the children of a node. A pageable connector will only be asked for as many pages as needed to handle the client's iteration, making it very efficient for exposing a node structure that can contain nodes with numerous children.