Chapter 16. MapReduce

16.1. About MapReduce

The JBoss Data Grid MapReduce model is an adaptation of Google's MapReduce model.
MapReduce is a programming model used to process and generate large data sets. It is typically used in distributed computing environments where nodes are clustered. In JBoss Data Grid, MapReduce allows transparent distributed processing of very large amounts of data across the data grid by performing most computations as locally possible to where the data is stored.
MapReduce uses the two distinct computational phases of map and reduce to process information requests through the data grid. The process occurs as follows:
  1. The user initiates a task on a cache instance, which runs on a cluster node (the master node).
  2. The master node receives the task input, divides the task, and sends tasks for map phase execution on the grid.
  3. Each node executes a Mapper function on its input, and returns intermediate results back to the master node.
    • If the distributedReducePhase parameter is set to "true", the map results are inserted in an intermediary cache, rather than being returned to the master node.
    • If a Combiner has been specified with task.combinedWith(Reducer), the Combiner is called on the Mapper results and the combiner's results are retured to the master node or inserted in the intermediary cache.
  4. The master node collects all intermediate results from the map phase and merges all intermediate values associated with the same intermediate key.
    • If the distributedReducePhase parameter is set to "true", the merging of the intermediate values is done on each node, as the Mapper or Combiner results are inserted in the intermediary cache.The master node only receives the intermediate keys.
  5. The master node sends intermediate key/value pairs for reduction on the grid.
    • If the distributedReducePhase parameter is set to "false", the reduction phase is executed only on the master node.
  6. The final results of the reduction phase are returned.
    • If the distributedReducePhase parameter is set to "true", the master node running the task receives all results from the reduction phase and returns the final result to the MapReduce task initiator.
    • If a Collator has been specified with task.execute(Collator), the Collator is executed on the reduction phase results, and the Collator result is returned to the task initiator.