Chapter 2. Reference Architecture Environment

2.1. Apache Spark

Apache Spark is a fast and general-purpose cluster computing system. It provides high-level APIs in Java, Scala, Python and R, and an optimized engine that supports general execution graphs. It also supports a rich set of higher-level tools including Spark SQL for SQL and structured data processing, MLlibfor machine learning, GraphX for graph processing, and Spark Streaming.

2.2. Red Hat JBoss Data Grid 7

2.2.1. Overview

Red Hat’s JBoss Data Grid 7 is an open source, distributed, in-memory key/value data store based on the Infinispan open source software project. Whether deployed in client/server mode or embedded in a Java Virtual Machine, it is built to be elastic, high performance, highly available and to scale linearly. Refer to the JDG 7 documentation for further details.

JBoss Data Grid is accessible for both Java and non-Java clients. Using JBoss Data Grid, data is distributed and replicated across a manageable cluster of nodes, optionally written to disk and easily accessible using the REST, Memcached and Hot Rod protocols, or directly in process through a traditional Java Map API.

2.2.2. JBoss Data Grid Usage Modes

Red Hat JBoss Data Grid offers two usage modes:

  • Remote Client-Server mode
  • Library mode Library Mode

Library mode allows building and deploying a custom runtime environment. The Library mode hosts a single data grid node in the application process, with remote access to nodes hosted in other JVMs. Refer to the JDG 7 documentation for further details. Remote Client-Server Mode

Remote Client-Server mode provides a managed, distributed, and clusterable data grid server. In Client-Server mode, the server runs as a self-contained process, utilizing a container based on Red Hat JBoss Enterprise Application Platform (JBoss EAP), allowing client applications to remotely access the data grid server using Hot Rod, Memcached or REST client APIs. Refer to the JDG 7 documentation for further details.

2.2.3. Apache Spark Integration

The use of a dedicated JVM for JBoss Data Grid allows for appropriate tuning and configuration for JDG versus Spark and client applications. In particular, handling the memory requirements of both Apache Spark and JBoss Data Grid in a single JVM can be difficult. For this and other reasons, Apache Spark integration support is only provided in Client-Server mode and this reference architecture sets up JBoss Data Grid accordingly.

JDG 7.0 introduces Resilient Distributed Dataset (RDD) and Discretized Stream (DStream) integration with Apache Spark version 1.6.0. This enables you to use JDG as a highly scalable, high-performance data source for Apache Spark, executing Spark and Spark Streaming operations on data stored in JDG. Refer to the JDG 7 documentation for further details.