Red Hat Training

A Red Hat training course is available for Red Hat Gluster Storage

8.2. Installing the Hadoop FileSystem Plugin for Red Hat Storage

8.2.1. Adding the Hadoop Installer for Red Hat Storage

You must have the big-data channel added and the hadoop components installed on all the servers to use the Hadoop feature on Red Hat Storage. Run the following command on the Ambari Management Server, the YARN Master Server and all the servers within the Red Hat Storage trusted storage pool:
# yum install rhs-hadoop rhs-hadoop-install

8.2.2. Configuring the Trusted Storage Pool for use with Hadoop

Red Hat Storage provides a series of utility scripts that allows you to quickly prepare Red Hat Storage for use with Hadoop, and install the Ambari Management Server. You must first run the Hadoop cluster configuration initial script to install the Ambari Management Server, prepare the YARN Master Server to host the Resource Manager and Job History Server services for Red Hat Storage and build a trusted storage pool if it does not exist.

Note

You must run the script given below irrespective of whether you have an existing Red Hat Storage trusted storage pool or not.
To run the Hadoop configuration initial script:
  1. Open the terminal window of the server designated to be the Ambari Management Server and navigate to the /usr/share/rhs-hadoop-install/ directory.
  2. Run the hadoop cluster configuration script as given below:
    setup_cluster.sh [-y] [--hadoop-mgmt-node <node>] [--yarn-master <node>] [--profile <profile>] [--ambari-repo <url>]  <node-list-spec>
    where <node-list-spec> is
    <node1>:<brickmnt1>:<blkdev1>  <node2>[:<brickmnt2>][:<blkdev2>]  [<node3>[:<brickmnt3>][:<blkdev3>]] ... [<nodeN>[:<brickmntN>][:<blkdevN>]]
    where
    • <brickmnt> is the name of the XFS mount for the above <blkdev>,for example, /mnt/brick1 or /external/HadoopBrick. When a Red Hat Storage volume is created its bricks has the volume name appended, so <brickmnt> is a prefix for the volume's bricks. Example: If a new volume is named HadoopVol then its brick list would be: <node>:/mnt/brick1/HadoopVol or <node>:/external/HadoopBrick/HadoopVol.
    • <blkdev> is the name of a Logical Volume device path, for example, /dev/VG1/LV1 or /dev/mapper/VG1-LV1. Since LVM is a prerequisite for Red Hat Storage, the <blkdev> is not expected to be a raw block path, such as /dev/sdb.
    Given below is an example of running the setup_cluster.sh script on the Ambari Management server and four Red Hat Storage Nodes which have the same logical volume and mount point intended to be used as a Red Hat Storage brick.
     ./setup_cluster.sh --yarn-master yarn.hdp rhs-1.hdp:/mnt/brick1:/dev/rhs_vg1/rhs_lv1 rhs-2.hdp rhs-3.hdp rhs-4.hdp

    Note

    If a brick mount is omitted, the brick mount of the first node is used and if one block device is omitted, the block device of the first node is used.

8.2.3. Creating Volumes for use with Hadoop

Note

If an existing Red Hat Storage volume is used with Hadoop, skip this section and continue with the instruction in the next section.
Whether you have a new or existing Red Hat Storage trusted storage pool, to create a volume for use with Hadoop, the volume need to be created in such a way as to support Hadoop workloads. The supported volume configuration for Hadoop is Distributed Replicated volume with replica count 2 or 3. You must not name the Hadoop enabled Red Hat Storage volume as hadoop or mapredlocal.
Run the script given below to create new volumes that you intend to use with Hadoop. The script provides the necessary configuration parameters to the volume as well as updates the Hadoop Configuration to make the volume accessible to Hadoop.
  1. Open the terminal window of the server designated to be the Ambari Management Server and navigate to the /usr/share/rhs-hadoop-install/ directory.
  2. Run the hadoop cluster configuration script as given below:
    create_vol.sh [-y] <volName> [--replica count] <volMountPrefix> <node-list>
    where
    • <--replica count> is the replica count. You can specify the replica count as 2 or 3. By default, the replica count is 2. The number of bricks must be a multiple of the replica count. The order in which bricks are specified determines how bricks are mirrored with each other. For example, first n bricks, where n is the replica count.
    • <node-list> is: <node1>:<brickmnt> <node2>[:<brickmnt2>] <node3>[:<brickmnt3>] ... [<nodeN>[:<brickmntN>
    • <brickmnt> is the name of the XFS mount for the block devices used by the above nodes, for example, /mnt/brick1 or /external/HadoopBrick. When a Red Hat Storage volume is created its bricks will have the volume name appended, so <brickmnt> is a prefix for the volume's bricks. For example, if a new volume is named HadoopVol then its brick list would be: <node>:/mnt/brick1/HadoopVol or <node>:/external/HadoopBrick/HadoopVol.

    Note

    The node-list for create_vol.sh is similar to the node-list-spec used by setup_cluster.sh except that a block device is not specified in create_vol.
    Given below is an example on how to create a volume named HadoopVol, using 4 Red Hat Storage Servers, each with the same brick mount and mount the volume on /mnt/glusterfs
    ./create_vol.sh HadoopVol /mnt/glusterfs rhs-1.hdp:/mnt/brick1 rhs-2.hdp rhs-3.hdp rhs-4.hdp

8.2.4. Deploying and Configuring the HDP 2.1 Stack on Red Hat Storage using Ambari Manager

Prerequisite

Before deploying and configuring the HDP stack, perfom the following steps:

  1. Open the terminal window of the server designated to be the Ambari Management Server and replace the HDP 2.1.GlusterFS repoinfo.xml file by the HDP 2.1 repoinfo.xml file.
    cp /var/lib/ambari-server/resources/stacks/HDP/2.1/repos/repoinfo.xml /var/lib/ambari-server/resources/stacks/HDP/2.1.GlusterFS/repos/
    You will be prompted to overwrite /2.1.GlusterFS/repos/repoinfo.xml file, type yes to overwrite the file.
  2. Restart the Ambari Server.
    # ambari-server restart
Perform the following steps to deploy and configure the HDP stack on Red Hat Storage:

Important

This section describes how to deploy HDP on Red Hat Storage. Selecting HDFS as the storage selection in the HDP 2.1.GlusterFS stack is not supported. If you want to deploy HDFS, then you must select the HDP 2.1 stack (not HDP 2.1.GlusterFS) and follow the instructions of the Hortonworks documentation.
  1. Launch a web browser and enter http://hostname:8080 in the URL by replacing hostname with the hostname of your Ambari Management Server.

    Note

    If the Ambari Console fails to load in the browser, it is usually because iptables is still running. Stop iptables by opening a terminal window and run service iptables stop command.
  2. Enter admin and admin for the username and password.
  3. Assign a name to your cluster, such as MyCluster.
  4. Select the HDP 2.1 GlusterFS Stack (if not already selected by default) and click Next.
  5. On the Install Options screen:
    1. For Target Hosts, add the YARN server and all the nodes in the trusted storage pool.
    2. Select Provide your SSH Private Key to automatically register hosts and provide your Ambari Server private key that was used to set up passwordless-SSH across the cluster.
    3. Click Register and Confirm button. It may take a while for this process to complete.
  6. For Confirm Hosts, it may take awhile for all the hosts to be confirmed.
    1. After this process is complete, you can ignore any warnings from the Host Check related to File and Folder Issues, Package Issues and User Issues as these are related to customizations that are required for Red Hat Storage.
    2. Click Next and ignore the Confirmation Warning.
  7. For Choose Services, unselect HDFS and as a minimum select GlusterFS, Ganglia, YARN+MapReduce2, ZooKeeper and Tez.

    Note

    • The use of Storm and Falcon have not been extensively tested and as yet are not supported.
    • Do not select the Nagios service, as it is not supported. For more information, see subsection 21.1. Deployment Scenarios of chapter 21. Administering the Hortonworks Data Platform on Red Hat Storage in the Red Hat Storage 3.0 Administration Guide.
    • This section describes how to deploy HDP on Red Hat Storage. Selecting HDFS as the storage selection in the HDP 2.1 GlusterFS stack is not supported. If users wish to deploy HDFS, then they must select the HDP 2.1 (not HDP 2.1.GlusterFS) and follow the instructions in the Hortonworks documentation.
  8. For Assign Masters, set all the services to your designated YARN Master Server.
    1. For ZooKeeper, select your YARN Master Server and at least 2 additional servers within your cluster.
    2. Click Next to proceed.
  9. For Assign Slaves and Clients, select all the nodes as NodeManagers except the YARN Master Server.
    1. Click Client checkbox for each selected node.
    2. Click Next to proceed.
  10. On the Customize Services screen:
    1. Click YARN tab, scroll down to the yarn.nodemanager.log-dirs and yarn.nodemanager.local-dirs properties and remove any entries that begin with /mnt/glusterfs/.

      Important

      New Red Hat Storage and Hadoop Clusters use the naming convention of /mnt/glusterfs/volname as the mount point for Red Hat Storage volumes. If you have existing Red Hat Storage volumes that has been created with different mount points, then remove the entries of those mount points.
    2. Update the following property on the YARN tab - Application Timeline Server section:
      Key Value
      yarn.timeline-service.leveldb-timeline-store.path /tmp/hadoop/yarn/timeline
    3. Review other tabs that are highlighted in red. These require you to enter additional information, such as passwords for the respective services.
  11. On the Review screen, review your configuration and then click Deploy button.
  12. On the Summary screen, click the Complete button and ignore any warnings and the Starting Services failed statement. This is normal as there is still some addition configuration that is required before we can start the services.
  13. Click Next to proceed to the Ambari Dashboard. Select the YARN service on the top left and click Stop-All. Do not click Start-All until you perform the steps in section Section 8.5, “Verifying the Configuration”.

8.2.5. Enabling Existing Volumes for use with Hadoop

Important

This section is mandatory for every volume you intend to use with Hadoop. It is not sufficient to run the create_vol.sh script, you must follow the steps listed in this section as well.
If you have a volume that you would like to analyze with Hadoop, and the volume was created by the above create_vol.sh script, then it must be enabled to support Hadoop workloads. Execute the enable_vol.sh script below to validate the volume's setup and to update Hadoop's core-site.xml configuration file which makes the volume accessible to Hadoop.
If you have a volume that was not created by the above create_vol.sh script, it is important to ensure that both the bricks and the volumes that you intend to use are properly mounted and configured. If they are not, the enable_vol.sh script will throw an exception and not run. Perform the following steps to mount and configure bricks and volumes with required parameters on all storage servers:
  1. Bricks need to be an XFS formatted logical volume and mounted with the noatime and inode64 parameters. For example, if we assume the logical volume path is /dev/rhs_vg1/rhs_lv1 and that path is being mounted on /mnt/brick1 then the /etc/fstab entry for the mount point should look as follows:
    /dev/rhs_vg1/rhs_lv1    /mnt/brick1  xfs   noatime,inode64   0 0
  2. Volumes must be mounted with the entry-timeout=0,attribute-timeout=0,use-readdirp=no,_netdev settings. Assuming your volume name is HadoopVol, the server's FQDN is rhs-1.hdp and your intended mount point for the volume is /mnt/glusterfs/HadoopVol then the /etc/fstab entry for the mount point of the volume must be as follows:
    rhs-1.hdp:/HadoopVol /mnt/glusterfs/HadoopVol glusterfs entry-timeout=0,attribute-timeout=0,use-readdirp=no,_netdev 0 0
    Volumes that are to be used with Hadoop also need to have specific volume level parameters set on them. In order to set these, shell into a node within the appropriate volume's trusted storage pool and run the following commands (the examples assume the volume name is HadoopVol):
      # gluster volume set HadoopVol  performance.stat-prefetch off
      # gluster volume set HadoopVol  cluster.eager-lock on 
      # gluster volume set HadoopVol  performance.quick-read off
    
  3. Perform the following to create several Hadoop directories on that volume:
    1. Open the terminal window of one of the Red Hat Storage nodes in the trusted storage pool and navigate to the /usr/share/rhs-hadoop-install directory.
    2. Run the bin/add_dirs.sh volume-mount-dir list-of-directories, where volume-mount-dir is the path name for the glusterfs-fuse mount of the volume you intend to enable for Hadoop (including the name of the volume) and list-of-directories is the list generated by running bin/gen_dirs.sh -d script. For example:
      # bin/add_dirs.sh /mnt/glusterfs/HadoopVol $(bin/gen_dirs.sh -d)
After completing these 3 steps, you are now ready to run the enable_vol.sh script.
Red Hat Storage-Hadoop has the concept of a default volume, which is the volume used when input and/or output URIs are unqualified. Unqualified URIs are common in Hadoop jobs, so defining the default volume, which can be set by enable_vol.sh script, is important. The default volume is the first volume appearing in the fs.glusterfs.volume property in the /etc/hadoop/conf/core-site.xml configuration file. The enable_vol.sh supports the --make-default option which, if specified, causes the supplied volume to be pre-pended to the above property, and thus, become the default volume. The default behavior for enable_vol.sh is to not make the target volume the default volume, meaning the volume name is appended, rather than prepended, to the above property value.
The --user and --pass options are required for the enable_vol.sh script to login into Ambari instance of the cluster to reconfigure Red Hat Storage volume related configuration.

Note

The supported volume configuration for Hadoop is Distributed Replicated volume with replica count 2 or 3. Also, when you run the enable_vol script for the first time, you must ensure to specify the --make-default option.
  1. Open the terminal window of the server designated to be the Ambari Management Server and navigate to the /usr/share/rhs-hadoop-install/ directory.
  2. Run the Hadoop Trusted Storage pool configuration script as given below:
    # enable_vol.sh [-y]  [--make-default] [--hadoop-mgmt-node node] [--user admin-user] [--pass admin-password] [--port mgmt-port-num] [--yarn-master yarn-node] [--rhs-node storage-node] volName
    For Example;
    # enable_vol.sh --yarn-master yarn.hdp  --rhs-node rhs-1.hdp HadoopVol --make-default

    Note

    If --yarn-master and/or --rhs-node options are omitted then the default of localhost (the node from which the script is being executed) is assumed. Example:
    ./enable_vol.sh --yarn-master yarn.hdp  --rhs-node rhs-1.hdp HadoopVol --make-default
    If this is the first time you are running enable_vol script, you will see a warning WARN: Cannot find configured default volume on node: rhs-1.hdp: "fs.glusterfs.volumes" property value is missing from /etc/hadoop/conf/core-site.xml. This is normal and the system will proceed to set the volume you are enabling as the default volume. You will not see this message when subsequently enabling additional volume.