4.5.4. Batching Database Operations

Hibernate has long had the ability to batch database operations, within the context of each transaction. The concept is to reduce the latency between the application platform and the database server. If each and every query is sent to the database individually, latency is increased because of the sheer number of round trips to and from the database. Batching of database operations can be used in any application that uses Hibernate as the persistence mechanism.
There are two Hibernate parameters that control the behavior of batching database operations:
  • hibernate.jdbc.fetch_size
  • hibernate.jdbc.batch_size
Both of these options set properties within the JDBC driver. In the first case, hibernate.jdbc.fetch_size sets the statement's fetch size within the JDBC driver, that is the number of rows fetched when there is more than a one row result on select statements. In the second case, hibernate.jdbc.batch_size determines the number of updates (inserts, updates and deletes) that are sent to the database at one time for execution. This parameter is necessary to do batch inserts, but must be coupled with the ordered inserts parameter and the JDBC driver's capability to rewrite the inserts into a batch insert statement.

Warning

Ordering inserts and updates only helps improve performance if the JDBC driver can rewrite the statements and should not be used unless the JDBC driver has that capability.
Determining the correct values to be assigned to each of these parameters requires analysis of each application's behavior when transacting with its database. The following sections examine differences in performance of the test application with differing values for each parameter.
The first parameter examined is hibernate.jdbc.fetch_size. In the example application, there are queries that return a single row, queries that return 3 rows, and one query that can return up to 500 rows. In the last case, the query itself is actually limited through the API to 500 rows, although there can be many more rows than that that could satisfy the query. The other interesting thing to consider is the fact that the query that could return 500 rows is executed in a stateful session bean. When the query is executed, only 20 rows of data, of the total matching rows, are returned at one time to the client. If the client found what they wanted, and does not page further through the result set, only those first 20 rows (or fewer rows, if the query returns less than 20) will ever be used from the result set. Taking that into account, fetch_size values of 20 and 500 were tested. In tests, 20 gave the best result, with 500 actually being slightly slower. Twenty was also faster than the default of 1. In the case of 20 versus 500, 20 provided 1.73% more throughput while in the case of 20 vs. 1, 20 provided 2.33% more throughput. Aligning the value with the query that can return upwards of 500 rows, and paginating the results 20 rows at a time was the most effective. Thorough analysis of the queries in the application and the frequency with which those queries are executed is necessary to size this correctly.
The application referred to throughout this book consists of two parts: a batch processing piece that creates reference data and an OLTP piece that processes business transactions using the reference data. In regards to these two parameters, they have very distinct needs so the challenge is how to meet the needs of both. One solution is to have multiple persistence units within a single application, which in turn means the application can have multiple entity managers, each with a different configuration. The norm is to have different entities within the different persistence units but while this is the norm, it is not prohibited to have the same entities in different persistence units. However you cannot use the same entities in multiple persistence units and their respective entities' managers at the same time. Because the batch processing of the application has to be run first to create the reference data, and the OLTP part run second (and independently of the batch processing) two persistence units could be defined with their respective entities managers and the same entities in both. That was highly convenient for the example application but in the real world that may not be possible. Instead, you may have to divide up batch processing and OLTP processing into different application deployments if they have the same entities. To illustrate this, here is the application's persistence.xml:
        <persistence xmlns="http://java.sun.com/xml/ns/persistence"
               xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
               xsi:schemaLocation="http://java.sun.com/xml/ns/persistence http://java.sun.com/xml/ns/persistence/persistence_1_0.xsd"
               version="1.0">
                <persistence-unit name="services" transaction-type="JTA">
                    <provider>org.hibernate.ejb.HibernatePersistence</provider>
                    <jta-data-source>java:/MySqlDS</jta-data-source>
                    <properties>
                        <property name="hibernate.hbm2ddl.auto" value="none"/>
                        <property name="hibernate.default_catalog" value="EJB3"/>
                        <property name="hibernate.dialect" value="org.hibernate.dialect.MySQL5InnoDBDialect"/>
                        <property name="hibernate.order_inserts" value="true"/>
                        <property name="hibernate.default_batch_fetch_size" value="20"/>
                        <property name="hibernate.jdbc.fetch_size" value="20"/>
                        <property name="hibernate.jdbc.batch_size" value="20"/>
                    <property name="hibernate.cache.region.factory_class" value="org.hibernate.cache.jbc2.JndiMultiplexedJBossCacheRegionFactory"/>
                    <property name="hibernate.cache.region.jbc2.cachefactory" value="java:CacheManager"/>
                    <property name="hibernate.cache.use_second_level_cache" value="true"/>
                    <property name="hibernate.cache.use_query_cache" value="false"/>
                    <property name="hibernate.cache.use_minimal_puts" value="true"/>
                    <property name="hibernate.cache.region.jbc2.cfg.entity" value="mvcc-entity"/>
                    <property name="hibernate.cache.region_prefix" value="services"/>
                    </properties>
                </persistence-unit>
                <persistence-unit name="batch-services" transaction-type="JTA">
                      <provider>org.hibernate.ejb.HibernatePersistence</provider>
                      <jta-data-source>java:/MySqlDS</jta-data-source>
                      <properties>
                          <property name="hibernate.hbm2ddl.auto" value="none"/>
                          <property name="hibernate.default_catalog" value="EJB3"/>
                          <property name="hibernate.order_inserts" value="true"/>
                          <property name="hibernate.jdbc.fetch_size" value="50000"/>
                          <property name="hibernate.jdbc.batch_size" value="50000"/>
                          <property name="hibernate.dialect" value="org.hibernate.dialect.MySQL5InnoDBDialect"/>
                        <property name="hibernate.cache.use_second_level_cache" value="false"/>
                      </properties>
                </persistence-unit>
        </persistence>
In this example, two persistence units are created: services and batch-services. The OLTP processing, "services", is configured with relatively small fetch and batch sizes. The batch processing, "batch-services", is configured with large fetch and batch sizes. There is no entity caching in the batch processing because its unnecessary, while there is an entity cache defined for the OLTP side of things. This is a very convenient and powerful feature of JPA.
For the second parameter, hibernate.jdbc.batch_size, the Hibernate documentation recommends a value of between 5 and 30 but this value depends upon the application's needs. The Hibernate documentation's recommendation is suitable for most OLTP-like applications. If you examine insert and update queries within the application, you can make a determination of how many rows are inserted and updated across the set. If inserts or updates occur no more than one row at a time, then leaving it unspecified is suitable. If inserts and/or updates affects many rows at a time, in a single transaction, then setting this is very important. If batch inserts are not required, and the aim is to send all the updates in as few trips to the database as possible, you should size it according to what is really happening in the application. If the largest transaction you have is inserting and updating 20 rows, then set it 20, or perhaps a little larger to accommodate any future changes that might take place.