Red Hat Training

A Red Hat training course is available for Red Hat Gluster Storage

7.2. Installing the Hadoop FileSystem Plugin for Red Hat Gluster Storage

7.2.1. Adding the Hadoop Installer for Red Hat Gluster Storage

You must have the big-data channel added and the hadoop components installed on all the servers to use the Hadoop feature on Red Hat Gluster Storage. Run the following command on the Ambari Management Server, the YARN Master Server and all the servers within the Red Hat Gluster Storage trusted storage pool:
# yum install rhs-hadoop rhs-hadoop-install

7.2.2. Configuring the Trusted Storage Pool for use with Hadoop

Red Hat Gluster Storage provides a series of utility scripts that allows you to quickly prepare Red Hat Gluster Storage for use with Hadoop, and install the Ambari Management Server. You must first run the Hadoop cluster configuration initial script to install the Ambari Management Server, prepare the YARN Master Server to host the Resource Manager and Job History Server services for Red Hat Gluster Storage and build a trusted storage pool if it does not exist.

Note

You must run the script given below irrespective of whether you have an existing Red Hat Gluster Storage trusted storage pool or not.
To run the Hadoop configuration initial script:
  1. Open the terminal window of the server designated to be the Ambari Management Server and navigate to the /usr/share/rhs-hadoop-install/ directory.
  2. Run the hadoop cluster configuration script as given below:
    setup_cluster.sh [-y]  [--quiet | --verbose | --debug] [--force-ambari-update] [--hadoop-mgmt-node <node>] [--yarn-master <node>] [--profile <profile>] [--ambari-repo <url>]  <node-list-spec>
    where <node-list-spec> is
    <node1>:<brickmnt1>:<blkdev1>  <node2>[:<brickmnt2>][:<blkdev2>]  [<node3>[:<brickmnt3>][:<blkdev3>]] ... [<nodeN>[:<brickmntN>][:<blkdevN>]]
    where
    • <brickmnt> is the name of the XFS mount for the above <blkdev>, for example, /mnt/brick1 or /external/HadoopBrick. When a Red Hat Gluster Storage volume is created its bricks has the volume name appended, so <brickmnt> is a prefix for the volume's bricks. Example: If a new volume is named HadoopVol then its brick list would be: <node>:/mnt/brick1/HadoopVol or <node>:/external/HadoopBrick/HadoopVol. Each brickmnt is mounted as follows with the following mount options: noatime, inode64.
    • <blkdev> is the name of a Logical Volume device path, for example, /dev/VG1/LV1 or /dev/mapper/VG1-LV1. Since LVM is a prerequisite for Red Hat Gluster Storage, the <blkdev> is not expected to be a raw block path, such as /dev/sdb.

      Note

      A minimum of two nodes, one brick mount, and one block device are required. A node can be repeated in <node-list>. For example, if host-1 has two different brick mounts and block devices then <node-list> could look like: host-1:/mnt/brick1:/dev/vg1/lv1, host-2 host-1:/mnt/brick2:/dev/vg1/lv2, and host-2:/mnt/brick2:/dev/vg1/lv2.
    • -y causes all prompts to be auto-answered a yes. The default is that the user must respond to each prompt.
    • --quiet is the default and produces the least output from the script.
    • --verbose outputs more information about the steps taken by the script.
    • --debug is the greatest level of output and is same as seen in the /var/log/rhs-hadoop-install.log log file.

      Note

      The /var/log/rhs-hadoop-install.log log file contains the --debug level of detailed information regardless of the verbose level chosen when running the script.
    • --profile is the server-global profile name to set via the tuned-adm command. For example, --profile rhs-high-throughput. If specified, profile is set on each storage node in the cluster and applies to all workloads on these storage nodes. Default is that no profile is set.
    • --ambari-repo names the URL to be used when updating the ambari agents and/or ambari server. The default is to use the URL hard-coded in the bin/gen_ambari_repo_url.sh script.
    • --force-ambari-update causes the ambari-agent and ambari-server to be re-installed and re-started on all nodes in the cluster, even if they are already running. The default is to install the ambari-agent on all nodes where it is not running, and to install the ambari-server if it too is not running. For a new cluster, the agents and ambari-server will be installed. When adding nodes to an existing cluster, the new nodes will have the ambari-agent installed and started, and the existing nodes, by default, will not have the agent nor ambari-server re-installed. For verifying an existing cluster, by default, the ambari-agent and ambari-server will not be re-installed. However, if --force-ambari-update is specified the ambari-agents and ambari-server will always be installed/re-installed.
    • --hadoop-mgmt-node is the hostname of the ambari management server. Default is localhost.
    • --yarn-master is the hostname of the YARN resource manager server. Default is localhost.
    Given below is an example of running the setup_cluster.sh script on the Ambari Management server and four Red Hat Gluster Storage Nodes which have the same logical volume and mount point intended to be used as a Red Hat Gluster Storage brick.
     ./setup_cluster.sh --yarn-master yarn.hdp rhs-1.hdp:/mnt/brick1:/dev/rhs_vg1/rhs_lv1 rhs-2.hdp rhs-3.hdp rhs-4.hdp

    Note

    If a brick mount is omitted, the brick mount of the first node is used and if one block device is omitted, the block device of the first node is used.

7.2.3. Creating Volumes for use with Hadoop

Note

If an existing Red Hat Gluster Storage volume is used with Hadoop, skip this section and continue with the instruction in the next section.
Whether you have a new or existing Red Hat Gluster Storage trusted storage pool, to create a volume for use with Hadoop, the volume needs to be created in such a way as to support Hadoop workloads. The supported volume configuration for Hadoop is Distributed Replicated volume with replica count of 2 or 3. You must not name the Hadoop enabled Red Hat Gluster Storage volume as hadoop or mapredlocal.
Run the script given below to create new volumes that you intend to use with Hadoop. The script provides the necessary configuration parameters to the volume as well as updates the Hadoop Configuration to make the volume accessible to Hadoop.
  1. Open the terminal window of the server designated to be the Ambari Management Server and navigate to the /usr/share/rhs-hadoop-install/ directory.
  2. Run the hadoop cluster configuration script as given below:
    create_vol.sh [-y][--quiet | --verbose | --debug] VOLNAME [--replica count] <volMountPrefix> <node-list>
    where
    • -y causes all prompts to be auto-answered "yes". The default is that the user must respond to each prompt.
    • --quiet is the default and produces the least output from the script.
    • --verbose outputs more information about the steps taken by the script.
    • --debug is the greatest level of output and is as seen in the /var/log/rhs-hadoop-install.log log file.

      Note

      The /var/log/rhs-hadoop-install.log log file contains the --debug level of detailed information regardless of the verbose level chosen when running the script.
    • --replica count is the replica count. You can specify the replica count as 2 or 3. By default, the replica count is 2. The number of bricks must be a multiple of the replica count. The order in which bricks are specified determines how bricks are mirrored with each other. For example, first n bricks, where n is the replica count.
    • <node-list> is: <node1>:<brickmnt> <node2>[:<brickmnt2>] <node3>[:<brickmnt3>] ... [<nodeN>[:<brickmntN>
    • VOLNAME is the name of the new Red Hat Gluster Storage volume. By default, the performance.stat-prefetch=off, cluster.eager-lock=on, and performance.quick-read=off performance related options are set on the volume. The new volume will be mounted on all storage nodes, even nodes not directly spanned by the volume, and on the yarn-master node.
    • volMountPrefix is the name of the gluster-fuse mount path without the volume name. For example, /mnt/glusterfs or /distributed.
    • brickmnt is the name of the XFS mount for the block devices used by the above nodes, for example, /mnt/brick1 or /external/HadoopBrick. When a Red Hat Gluster Storage volume is created its bricks will have the volume name appended, so brickmnt is a prefix for the volume's bricks. For example, if a new volume is named HadoopVol then its brick list would be: <node>:/mnt/brick1/HadoopVol or <node>:/external/HadoopBrick/HadoopVol.

    Note

    The node-list for create_vol.sh is similar to the node-list-spec used by setup_cluster.sh except that a block device is not specified in create_vol.
    Given below is an example on how to create a volume named HadoopVol, using four Red Hat Gluster Storage Servers, each with the same brick mount and mount the volume on /mnt/glusterfs
    ./create_vol.sh HadoopVol /mnt/glusterfs rhs-1.hdp:/mnt/brick1 rhs-2.hdp rhs-3.hdp rhs-4.hdp

7.2.4. Deploying and Configuring the HDP 2.1 Stack on Red Hat Gluster Storage using Ambari Manager

Prerequisite

Before deploying and configuring the HDP stack, perform the following steps:

  1. Open the terminal window of the server designated to be the Ambari Management Server and replace the HDP 2.1.GlusterFS repoinfo.xml file by the HDP 2.1 repoinfo.xml file.
    cp /var/lib/ambari-server/resources/stacks/HDP/2.1/repos/repoinfo.xml /var/lib/ambari-server/resources/stacks/HDP/2.1.GlusterFS/repos/
    You will be prompted to overwrite /2.1.GlusterFS/repos/repoinfo.xml file, type yes to overwrite the file.
  2. Restart the Ambari Server.
    # ambari-server restart
Perform the following steps to deploy and configure the HDP stack on Red Hat Gluster Storage:

Important

This section describes how to deploy HDP on Red Hat Gluster Storage. Selecting HDFS as the storage selection in the HDP 2.1.GlusterFS stack is not supported. If you want to deploy HDFS, then you must select the HDP 2.1 stack (not HDP 2.1.GlusterFS) and follow the instructions of the Hortonworks documentation.
Ensure to select only the 2.1.GlusterFS supported stack. The other unsupported *GlusterFS stacks might be available for selection.
  1. Launch a web browser and enter http://hostname:8080 in the URL by replacing hostname with the hostname of your Ambari Management Server.

    Note

    If the Ambari Console fails to load in the browser, it is usually because iptables is still running. Stop iptables by opening a terminal window and run service iptables stop command.
  2. Enter admin and admin for the username and password.
  3. Assign a name to your cluster, such as MyCluster.
  4. Select the HDP 2.1 GlusterFS Stack (if not already selected by default) and click Next.
  5. On the Install Options screen:
    1. For Target Hosts, add the YARN server and all the nodes in the trusted storage pool.
    2. Select Provide your SSH Private Key to automatically register hosts and provide your Ambari Server private key that was used to set up passwordless-SSH across the cluster.
    3. Click Register and Confirm button. It may take a while for this process to complete.
  6. For Confirm Hosts, it may take awhile for all the hosts to be confirmed.
    1. After this process is complete, you can ignore any warnings from the Host Check related to File and Folder Issues, Package Issues and User Issues as these are related to customizations that are required for Red Hat Gluster Storage.
    2. Click Next and ignore the Confirmation Warning.
  7. For Choose Services, unselect HDFS and as a minimum select GlusterFS, Ganglia, YARN+MapReduce2, ZooKeeper and Tez.

    Note

    • The use of Storm and Falcon have not been extensively tested and as yet are not supported.
    • Do not select the Nagios service, as it is not supported. For more information, see subsection 21.1. Deployment Scenarios of chapter 21. Administering the Hortonworks Data Platform on Red Hat Gluster Storage in the Red Hat Gluster Storage 3.0 Administration Guide.
    • This section describes how to deploy HDP on Red Hat Gluster Storage. Selecting HDFS as the storage selection in the HDP 2.1 GlusterFS stack is not supported. If users wish to deploy HDFS, then they must select the HDP 2.1 (not HDP 2.1.GlusterFS) and follow the instructions in the Hortonworks documentation.
  8. For Assign Masters, set all the services to your designated YARN Master Server.
    1. For ZooKeeper, select your YARN Master Server and at least 2 additional servers within your cluster.
    2. Click Next to proceed.
  9. For Assign Slaves and Clients, select all the nodes as NodeManagers except the YARN Master Server.
    1. Click Client checkbox for each selected node.
    2. Click Next to proceed.
  10. On the Customize Services screen:
    1. Click YARN tab, scroll down to the yarn.nodemanager.log-dirs and yarn.nodemanager.local-dirs properties and remove any entries that begin with /mnt/glusterfs/.

      Important

      New Red Hat Gluster Storage and Hadoop Clusters use the naming convention of /mnt/glusterfs/volname as the mount point for Red Hat Gluster Storage volumes. If you have existing Red Hat Gluster Storage volumes that has been created with different mount points, then remove the entries of those mount points.
    2. Update the following property on the YARN tab - Application Timeline Server section:
      Key Value
      yarn.timeline-service.leveldb-timeline-store.path /tmp/hadoop/yarn/timeline
    3. Review other tabs that are highlighted in red. These require you to enter additional information, such as passwords for the respective services.
  11. On the Review screen, review your configuration and then click Deploy button.
  12. On the Summary screen, click the Complete button and ignore any warnings and the Starting Services failed statement. This is normal as there is still some addition configuration that is required before we can start the services.
  13. Click Next to proceed to the Ambari Dashboard. Select the YARN service on the top left and click Stop-All. Do not click Start-All until you perform the steps in section Section 7.5, “Verifying the Configuration”.

7.2.5. Enabling Existing Volumes for use with Hadoop

Important

This section is mandatory for every volume you intend to use with Hadoop. It is not sufficient to run the create_vol.sh script, you must follow the steps listed in this section as well.
If you have a volume that you would like to analyze with Hadoop, and the volume was created by the above create_vol.sh script, then it must be enabled to support Hadoop workloads. Execute the enable_vol.sh script below to validate the volume's setup and to update Hadoop's core-site.xml configuration file which makes the volume accessible to Hadoop.
If you have a volume that was not created by the above create_vol.sh script, it is important to ensure that both the bricks and the volumes that you intend to use are properly mounted and configured. If they are not, the enable_vol.sh script will display and log volume configuration errors. Perform the following steps to mount and configure bricks and volumes with required parameters on all storage servers:
  1. Bricks need to be an XFS formatted logical volume and mounted with the noatime and inode64 parameters. For example, if we assume the logical volume path is /dev/rhs_vg1/rhs_lv1 and that path is being mounted on /mnt/brick1 then the /etc/fstab entry for the mount point should look as follows:
    /dev/rhs_vg1/rhs_lv1    /mnt/brick1  xfs   noatime,inode64   0 0
  2. Volumes must be mounted with the _netdev setting. Assuming your volume name is HadoopVol, the server's FQDN is rhs-1.hdp and your intended mount point for the volume is /mnt/glusterfs/HadoopVol then the /etc/fstab entry for the mount point of the volume must be as follows:
    rhs-1.hdp:/HadoopVol /mnt/glusterfs/HadoopVol glusterfs _netdev 0 0
    Volumes that are to be used with Hadoop also need to have specific volume level parameters set on them. In order to set these, shell into a node within the appropriate volume's trusted storage pool and run the following commands (the examples assume the volume name is HadoopVol):
    # gluster volume set HadoopVol  performance.stat-prefetch off
    # gluster volume set HadoopVol  cluster.eager-lock on 
    # gluster volume set HadoopVol  performance.quick-read off
  3. Perform the following to create several Hadoop directories on that volume:
    1. Open the terminal window of one of the Red Hat Gluster Storage nodes in the trusted storage pool and navigate to the /usr/share/rhs-hadoop-install directory.
    2. Run the bin/add_dirs.sh volume-mount-dir , list-of-directories, where volume-mount-dir is the path name for the glusterfs-fuse mount of the volume you intend to enable for Hadoop (including the name of the volume) and list-of-directories is the list generated by running bin/gen_dirs.sh -d script. For example:
      # bin/add_dirs.sh /mnt/glusterfs/HadoopVol $(bin/gen_dirs.sh -d)
After completing these 3 steps, you are now ready to run the enable_vol.sh script.
Red Hat Gluster Storage-Hadoop has the concept of a default volume, which is the volume used when input and/or output URIs are unqualified. Unqualified URIs are common in Hadoop jobs, so defining the default volume, which can be set by enable_vol.sh script, is important. The default volume is the first volume appearing in the fs.glusterfs.volume property in the /etc/hadoop/conf/core-site.xml configuration file. The enable_vol.sh supports the --make-default option which, if specified, causes the supplied volume to be pre-pended to the above property and thus become the default volume. The default behavior for enable_vol.sh is to not make the target volume the default volume, meaning the volume name is appended, rather than prepended, to the above property value.
The --user and --pass options are required for the enable_vol.sh script to login into Ambari instance of the cluster to reconfigure Red Hat Gluster Storage volume related configuration.

Note

The supported volume configuration for Hadoop is Distributed Replicated volume with replica count of 2 or 3. Also, when you run the enable_vol script for the first time, you must specify the --make-default option.
  1. Open the terminal window of the server designated to be the Ambari Management Server and navigate to the /usr/share/rhs-hadoop-install/ directory.
  2. Run the Hadoop Trusted Storage pool configuration script as given below:
    # enable_vol.sh [-y] [--quiet | --verbose | --debug]  [--make-default] [--hadoop-mgmt-node node] [--yarn-master yarn-node][--rhs-node storage-node] [--user ambari-admin-user] [--pass admin-password] VOLNAME
    For Example;
    # enable_vol.sh --yarn-master yarn.hdp  --rhs-node rhs-1.hdp HadoopVol --make-default
    • VOLNAME is the name of the Red Hat Gluster Storage volume.
    • --yarn-master is the hostname of the YARN resource manager server. Default is localhost.
    • --rhs-node is the name of any of the existing Red Hat Gluster Storage nodes in the cluster. It is required unless this script is being run from a storage node. This value is necessary in order to run the gluster CLI.
    • --user and --pass are required to update the hadoop configuration files (core-site.xml) residing on each node spanned by the volume. You must update core-site.xml file for a volume to be visible to Hadoop jobs. These options are defaulted to the Ambari defaults.
    • --make-default indicates that VOLNAME is to be made the default volume by pre-pending it to the core-site.xml volumes list property. The default behavior is to not alter the default volume name in core-site.xml file.
    • --quiet is the default and produces the least output from the script.
    • --verbose outputs more information about the steps taken by the script.
    • --debug is the greatest level of output and is same as seen in the /var/log/rhs-hadoop-install.log log file.

      Note

      The /var/log/rhs-hadoop-install.log log file contains the --debug level of detailed information regardless of the verbose level chosen when running the script.
    • -y causes all prompts to be auto-answered "yes". The default is that the user must respond to each prompt.
    • VOLNAME is the name of the new Red Hat Gluster Storage volume.

    Note

    If --yarn-master and/or --rhs-node options are omitted then the default of localhost (the node from which the script is being executed) is assumed. Example:
    ./enable_vol.sh --yarn-master yarn.hdp  --rhs-node rhs-1.hdp HadoopVol --make-default