Chapter 26. Administering the Hortonworks Data Platform on Red Hat Gluster Storage

Warning

Support for Hortonworks Data Platform (HDP) on Red Hat Gluster Storage integrated using the Hadoop Plug-In is deprecated as of Red Hat Gluster Storage 3.1 Update 2, and is unlikely to be supported in the next major release. Red Hat discourages further use of this plug-in for deployments where Red Hat Gluster Storage is directly used for holding analytics data for running in-place analytics. However, Red Hat Gluster Storage can be used as a general purpose repository for holding analytics data and as a companion store where the bulk of the data is stored and then moved to Hadoop clusters for analysis when necessary.
Red Hat Gluster Storage provides filesystem compatibility for Apache Hadoop and uses the standard file system APIs available in Hadoop to provide a new storage option for Hadoop deployments. Existing Hadoop Ecosystem applications can use Red Hat Gluster Storage seamlessly.

Important

The following features of Red Hat Gluster Storage is not supported with Hadoop:
  • Dispersed Volumes and Distributed Dispersed Volume
  • Red Hat Enterprise Linux 7
Advantages

The following are the advantages of Hadoop Compatible Storage with Red Hat Gluster Storage:

  • Provides file-based access to Red Hat Gluster Storage volumes by Hadoop while simultaneously supporting POSIX features for the volumes such as NFS Mounts, Fuse Mounts, Snapshotting and Geo-Replication.
  • Eliminates the need for a centralized metadata server (HDFS Primary and Redundant Namenodes) by replacing HDFS with Red Hat Gluster Storage.
  • Provides compatibility with MapReduce and Hadoop Ecosystem applications with no code rewrite required.
  • Provides a fault tolerant file system.
  • Allows co-location of compute and data and the ability to run Hadoop jobs across multiple namespaces using multiple Red Hat Gluster Storage volumes.

26.1. Deployment Scenarios

You must ensure to meet the prerequisites by establishing the basic infrastructure required to enable Hadoop Distributions to run on Red Hat Gluster Storage. For information on prerequisites and installation procedure, see Deploying the Hortonworks Data Platform on Red Hat Gluster Storage chapter in Red Hat Gluster Storage 3.1 Installation Guide.
The supported volume configuration for Hadoop is Distributed Replicated volume with replica count 2 or 3.
The following table provides the overview of the components of the integrated environment.

Table 26.1. Component Overview

Component Overview Component Description
Ambari Management Console for the Hortonworks Data Platform
Red Hat Gluster Storage Console (Optional) Management Console for Red Hat Gluster Storage
YARN Resource Manager Scheduler for the YARN Cluster
YARN Node Manager Worker for the YARN Cluster on a specific server
Job History Server This logs the history of submitted YARN Jobs
glusterd This is the Red Hat Gluster Storage process on a given server

26.1.1. Red Hat Gluster Storage Trusted Storage Pool with Two Additional Servers

The recommended approach to deploy the Hortonworks Data Platform on Red Hat Gluster Storage is to add two additional servers to your trusted storage pool. One server acts as the Management Server hosting the management components such as Hortonworks Ambari and Red Hat Gluster Storage Console (optional). The other server acts as the YARN Master Server and hosts the YARN Resource Manager and Job History Server components. This design ensures that the YARN Master processes do not compete for resources with the YARN NodeManager processes. Furthermore, it also allows the Management server to be multi-homed on both the Hadoop Network and User Network, which is useful to provide users with limited visibility into the cluster.
Recommended Deployment Topology for Large Clusters

Figure 26.1. Recommended Deployment Topology for Large Clusters

26.1.2. Red Hat Gluster Storage Trusted Storage Pool with One Additional Server

If two servers are not available, you can install the YARN Master Server and the Management Server on a single server. This is also an option if you have a server with abundant CPU and Memory available. It is recommended that the utilization is carefully monitored on the server to ensure that sufficient resources are available to all the processes. If resources are being over-utilized, it is recommended that you move to the deployment topology for a large cluster as explained in the previous section. Ambari supports the ability to relocate the YARN Resource Manager to another server after it is deployed. It is also possible to move Ambari to another server after it is installed.
Recommended Deployment Topology for Smaller Clusters

Figure 26.2. Recommended Deployment Topology for Smaller Clusters

26.1.3. Red Hat Gluster Storage Trusted Storage Pool only

If no additional servers are available, one can condense the processes on the YARN Master Server and the Management Server on a server within the trusted storage pool. This option is recommended only in a evaluation environment with workloads that do not utilize the servers heavily. It is recommended that the utilization is carefully monitored on the server to ensure that sufficient resources are available for all the processes. If the resources start are over-utilized, it is recommended that you move to the deployment topology detailed in Section 26.1.1, “Red Hat Gluster Storage Trusted Storage Pool with Two Additional Servers”. Ambari supports the ability to relocate the YARN Resource Manager to another server after it is deployed. It is also possible to move Ambari to another server after it is installed.
Evaluation deployment topology using the minimum amount of servers

Figure 26.3. Evaluation deployment topology using the minimum amount of servers

26.1.4. Deploying Hadoop on an existing Red Hat Gluster Storage Trusted Storage Pool

If you have an existing Red Hat Gluster Storage Trusted Storage Pool then you need to procure two additional servers for the YARN Master and Ambari Management Server as depicted in the deployment topology detailed in Section 26.1.1, “Red Hat Gluster Storage Trusted Storage Pool with Two Additional Servers”. If you have no existing volumes within the trusted storage pool you need to follow the instructions in the installation guide to create and enable those volumes for Hadoop. If you have existing volumes you need to follow the instructions to enable them for Hadoop.
The supported volume configuration for Hadoop is Distributed Replicated volume with replica count 2 or 3.

26.1.5. Deploying Hadoop on a New Red Hat Gluster Storage Trusted Storage Pool

If you do not have an existing Red Hat Gluster Storage Trusted Storage Pool, you must procure all the servers listed in the deployment topology detailed in Section 26.1.1, “Red Hat Gluster Storage Trusted Storage Pool with Two Additional Servers”. You must then follow the installation instructions listed in the Red Hat Gluster Storage 3.1 Installation Guide so that the setup_cluster.sh script can build the storage pool for you. The rest of the installation instructions will articulate how to create and enable volumes for use with Hadoop.
The supported volume configuration for Hadoop is Distributed Replicated volume with replica count 2 or 3.