Chapter 7. Deploying the Hortonworks Data Platform 2.1 on Red Hat Gluster Storage

Warning

Support for Hortonworks Data Platform (HDP) on Red Hat Gluster Storage integrated using the Hadoop Plug-In is deprecated as of Red Hat Gluster Storage 3.1 Update 2, and is unlikely to be supported in the next major release. Red Hat discourages further use of this plug-in for deployments where Red Hat Gluster Storage is directly used for holding analytics data for running in-place analytics. However, Red Hat Gluster Storage can be used as a general purpose repository for holding analytics data and as a companion store where the bulk of the data is stored and then moved to Hadoop clusters for analysis when necessary.
Red Hat Gluster Storage provides compatibility for Apache Hadoop and it uses the standard file system APIs available in Hadoop to provide a new storage option for Hadoop deployments. Red Hat has created a Hadoop File System plug-in that enables Hadoop Distributions to run on Red Hat Gluster Storage.

Important

The following features of Red Hat Gluster Storage is not supported with Hadoop:
  • Dispersed Volumes and Distributed Dispersed Volume
  • Red Hat Enterprise Linux 7.x

7.1. Prerequisites

Before you begin installation, you must establish the basic infrastructure required to enable Hadoop to run on Red Hat Gluster Storage.

7.1.1. Supported Versions

The following table lists the supported versions of HDP and Ambari with Red Hat Gluster Storage Server.

Table 7.1. Red Hat Gluster Storage Server Support Matrix

Red Hat Gluster Storage Server version HDP version Ambari version
3.1 2.1 1.6.1

7.1.2. Software and Hardware Requirements

You must ensure that all the servers used in this environment meet the following requirements:
  • Must have at least the following hardware specification:
    • 2 x 2 GHz 4 core processors
    • 32 GB RAM
    • 500 GB of storage capacity
    • 1 x 1 GbE NIC
  • Must have iptables disabled.
  • Must use fully qualified domain names (FQDN). For example rhs-1.server.com is acceptable, but rhs-1 is not allowed.
  • SELinux must be in disabled mode.
  • Time on all the servers must be uniform. It is recommended to set up a NTP (Network Time Protocol) service to keep the time synchronized.
  • Either, all servers must be configured to use a DNS server and must be able to use DNS for FQDN resolution or all the storage nodes must have the FQDN of all of the servers in the cluster listed in their /etc/hosts file.
  • Must have the following users and groups available on all the servers.
    User Group
    yarn hadoop
    mapred hadoop
    hive hadoop
    hcat hadoop
    ambari-qa hadoop
    hbase hadoop
    tez hadoop
    zookeeper hadoop
    oozie hadoop
    falcon hadoop
    The specific UIDs and GIDs for the respective users and groups are up to the Administrator of the trusted storage pool, but they must be consistent across the trusted storage pool. For example, if the "hadoop" user has a UID as 591 on one server, the hadoop user must have UID as 591 on all other servers. This can be quite a lot of work to manage using Local Authentication and it is common and acceptable to install a central authentication solution such as LDAP or Active Directory for your cluster, so that users and groups can be easily managed in one place. However, to use local authentication, you can run the script below on each server to create the users and groups and ensure they are consistent across the cluster:
    groupadd hadoop -g 590; useradd -u 591 mapred -g hadoop; useradd -u 592 yarn -g hadoop; useradd -u 594 hcat -g hadoop; useradd -u 595 hive -g hadoop; useradd -u 590 ambari-qa -g hadoop; useradd -u 593 tez -g hadoop; useradd -u 596 oozie -g hadoop; useradd -u 597 zookeeper -g hadoop; useradd -u 598 falcon -g hadoop; useradd -u 599 hbase -g hadoop

7.1.3. Existing Red Hat Gluster Storage Trusted Storage Pool

If you have an existing Red Hat Gluster Storage trusted storage pool, you need to add two additional servers to run the Hortonworks Ambari Management Services and the YARN Master Services, respectively. For more information on recommended deployment topologies, see Administering the Hortonworks Data Platform on Red Hat Gluster Storage chapter in Red Hat Gluster Storage Administration Guide.
In addition, all nodes within the Red Hat Gluster Storage Trusted Storage Pool that contain volumes that are to be used with Hadoop must contain a local glusterfs-fuse mount of that volume. The path of the mount for each volume must be consistent across the cluster.
For information on expanding your trusted storage pool by adding servers, see section Expanding Volumes in the Red Hat Gluster Storage 3.1 Administration Guide.

Note

The supported volume configuration for Hadoop is Distributed Replicated volume with replica count of 2 or 3.
SELinux must be in disabled mode. The rhs-hadoop-install script does not recognize SELinux in permissive mode and requires SELinux to be disabled completely. This requires additional restart of all storage machines.

Important

New Red Hat Gluster Storage and Hadoop Clusters use the naming conventions of /mnt/brick1 as the mount point for Red Hat Gluster Storage bricks and /mnt/glusterfs/volname as the mount point for Red Hat Gluster Storage volume. It is possible that you have an existing Red Hat Gluster Storage volume that has been created with different mount points for the Red Hat Gluster Storage bricks and volumes. If the mount points differ from the convention, replace the prefix listed in this installation guide with the prefix that you have.
Information on how to mount and configure bricks and volumes with required parameters and description of required local mount of gluster volume are available in Section 7.2.5, “Enabling Existing Volumes for use with Hadoop”

7.1.4. New Red Hat Gluster Storage Trusted Storage Pool

You must create a Red Hat Gluster Storage trusted storage pool with at least four bricks for two-way replication and with six bricks for three-way replication. The servers on which these bricks reside must have the Red Hat Gluster Storage installed on them. The number of bricks must be a multiple of the replica count for a distributed replicated volume.
For more information on installing Red Hat Gluster Storage see Chapter 2, Installing Red Hat Gluster Storage or for upgrading to Red Hat Gluster Storage 3.1, see Chapter 9, Upgrading from Red Hat Gluster Storage 2.1.x to Red Hat Gluster Storage 3.1 .
Red Hat recommends that you have an additional two servers set aside to run the Hortonworks Ambari Management Services and the YARN Master Services, respectively. Alternate deployment topologies are also possible, for more information on various supported deployment topologies, see Administering the Hortonworks Data Platform on Red Hat Gluster Storage chapter in Red Hat Gluster Storage Administration Guide.
For information on expanding your trusted storage pool by adding servers, see section Expanding Volumes in the Red Hat Gluster Storage 3.1 Administration Guide.

Note

The supported volume configuration for Hadoop is Distributed Replicated volume with replica count of 2 or 3.

7.1.5. Red Hat Gluster Storage Server Requirements

You must install Red Hat Gluster Storage Server on the server. While installing the server, you must ensure to specify a fully qualified domain name (FQDN). A hostname alone will not meet the requirements for the Hortonworks Data Platform Ambari deployment tool.
You must also enable the rhs-big-data-3-for-rhel-6-server-rpms channel on this server.
  • If you have registered your machine using Red Hat Subscription Manager, enable the repository by running the following command:
    # subscription-manager repos --enable=rhs-big-data-3-for-rhel-6-server-rpms
  • If you have registered your machine using Satellite server, enable the channel by running the following command:
    # rhn-channel --add --channel rhel-x86_64-server-6-rhs-bigdata-3

7.1.6. Hortonworks Ambari Server Requirements

You must install Red Hat Enterprise Linux 6.6 on the servers. You can also choose to install Red Hat Gluster Storage Console on this server as well, but this is optional. This allows all aspects of the Red Hat Gluster Storage trusted pool to be managed from a single server. While installing the server, you must ensure to specify a fully qualified domain name (FQDN). A hostname alone will not meet the requirements for the Horton Data Platform Ambari deployment tool. It is mandatory to setup a passwordless-SSH connection from the Ambari Server to all other servers within the trusted storage pool. Instructions for installing and configuring Hortonworks Ambari is provided in the further sections of this chapter.
If the Hortonworks Ambari server is installed on a different node than Red Hat Gluster Storage Server, you must also enable the rhel-6-server-rh-common-rpms channel on this server.
  • If you have registered your machine using Red Hat Subscription Manager, enable the repository by running the following command:
    # subscription-manager repos --enable=rhel-6-server-rh-common-rpms
  • If you have registered your machine using Satellite server, enable the channel by running the following command:
    # rhn-channel --add --channel rhel-x86_64-server-rh-common-6

Warning

Red Hat Gluster Storage Console enables Nagios Alerting for Red Hat Gluster Storage. The Nagios Client libraries are shipped with Red Hat Gluster Storage and are on each Red Hat Gluster Storage Server. This causes a conflict with the Nagios System that is bundled with the Hortonworks Data Platform (HDP). Hence, using Ambari to deploy and manage HDP Nagios is not supported.

Note

If you are using one of the condensed deployment topologies listed in the Administration Guide and you have elected to place the Ambari Management server on the same node as a Red Hat Gluster Storage Server, you must only enable the rhs-big-data-3-for-rhel-6-server-rpms channel on that server.
  • If you have registered your machine using Red Hat Subscription Manager, enable the repository by running the following command:
    # subscription-manager repos --enable=rhs-big-data-3-for-rhel-6-server-rpms
  • If you have registered your machine using Satellite server, enable the channel by running the following command:
    # rhn-channel --add --channel rhel-x86_64-server-6-rhs-bigdata-3

7.1.7. YARN Master Server Requirements

You must install the Red Hat Enterprise Linux 6.6 on this server. While installing the server, you must ensure to specify a fully qualified domain name (FQDN). A hostname alone will not meet the requirements for the Horton Data Platform Ambari deployment tool.
If the YARN Master server is installed on a different node than Red Hat Gluster Storage Server, you must also enable the rhel-6-server-rh-common-rpms and rhel-6-server-rhs-client-1-rpms channels on the YARN server.
  • If you have registered your machine using Red Hat Subscription Manager, enable the repositories by running the following command:
    # subscription-manager repos --enable=rhel-6-server-rh-common-rpms --enable=rhel-6-server-rhs-client-1-rpms
    
  • If you have registered your machine using Satellite server, enable the channel by running the following command:
    # rhn-channel --add --channel rhel-x86_64-server-rh-common-6
    # rhn-channel --add --channel rhel-x86_64-server-rhsclient-6

Note

If you are using one of the condensed deployment topologies listed in the Administration Guide and you have elected to place the YARN Master server on the same node as a Red Hat Gluster Storage Server, you must only enable the rhs-big-data-3-for-rhel-6-server-rpms channel on that server.
  • If you have registered your machine using Red Hat Subscription Manager, enable the repository by running the following command:
    # subscription-manager repos --enable=rhs-big-data-3-for-rhel-6-server-rpms
  • If you have registered your machine using Satellite server, enable the channel by running the following command:
    # rhn-channel --add --channel rhel-x86_64-server-6-rhs-bigdata-3