Show Table of Contents
7.2. Installing the Hadoop FileSystem Plugin for Red Hat Gluster Storage
7.2.1. Adding the Hadoop Installer for Red Hat Gluster Storage
You must have the big-data channel added and the hadoop components installed on all the servers to use the Hadoop feature on Red Hat Gluster Storage. Run the following command on the Ambari Management Server, the YARN Master Server and all the servers within the Red Hat Gluster Storage trusted storage pool:
# yum install rhs-hadoop rhs-hadoop-install
7.2.2. Configuring the Trusted Storage Pool for use with Hadoop
Red Hat Gluster Storage provides a series of utility scripts that allows you to quickly prepare Red Hat Gluster Storage for use with Hadoop, and install the Ambari Management Server. You must first run the Hadoop cluster configuration initial script to install the Ambari Management Server, prepare the YARN Master Server to host the Resource Manager and Job History Server services for Red Hat Gluster Storage and build a trusted storage pool if it does not exist.
Note
You must run the script given below irrespective of whether you have an existing Red Hat Gluster Storage trusted storage pool or not.
To run the Hadoop configuration initial script:
- Open the terminal window of the server designated to be the Ambari Management Server and navigate to the
/usr/share/rhs-hadoop-install/directory. - Run the hadoop cluster configuration script as given below:
setup_cluster.sh [-y] [--quiet | --verbose | --debug] [--force-ambari-update] [--hadoop-mgmt-node <node>] [--yarn-master <node>] [--profile <profile>] [--ambari-repo <url>] <node-list-spec>
where <node-list-spec> is<node1>:<brickmnt1>:<blkdev1> <node2>[:<brickmnt2>][:<blkdev2>] [<node3>[:<brickmnt3>][:<blkdev3>]] ... [<nodeN>[:<brickmntN>][:<blkdevN>]]
where<brickmnt>is the name of the XFS mount for the above<blkdev>, for example,/mnt/brick1or/external/HadoopBrick. When a Red Hat Gluster Storage volume is created its bricks has the volume name appended, so<brickmnt>is a prefix for the volume's bricks. Example: If a new volume is namedHadoopVolthen its brick list would be:<node>:/mnt/brick1/HadoopVolor<node>:/external/HadoopBrick/HadoopVol. Each brickmnt is mounted as follows with the following mount options:noatime,inode64.<blkdev>is the name of a Logical Volume device path, for example,/dev/VG1/LV1or/dev/mapper/VG1-LV1. Since LVM is a prerequisite for Red Hat Gluster Storage, the<blkdev>is not expected to be a raw block path, such as/dev/sdb.Note
A minimum of two nodes, one brick mount, and one block device are required. A node can be repeated in <node-list>. For example, if host-1 has two different brick mounts and block devices then <node-list> could look like:host-1:/mnt/brick1:/dev/vg1/lv1,host-2 host-1:/mnt/brick2:/dev/vg1/lv2, andhost-2:/mnt/brick2:/dev/vg1/lv2.-ycauses all prompts to be auto-answered ayes. The default is that the user must respond to each prompt.--quietis the default and produces the least output from the script.--verboseoutputs more information about the steps taken by the script.--debugis the greatest level of output and is same as seen in the/var/log/rhs-hadoop-install.loglog file.Note
The/var/log/rhs-hadoop-install.loglog file contains the--debuglevel of detailed information regardless of the verbose level chosen when running the script.--profileis the server-global profile name to set via thetuned-admcommand. For example,--profile rhs-high-throughput. If specified, profile is set on each storage node in the cluster and applies to all workloads on these storage nodes. Default is that no profile is set.--ambari-reponames the URL to be used when updating the ambari agents and/or ambari server. The default is to use the URL hard-coded in thebin/gen_ambari_repo_url.shscript.--force-ambari-updatecauses the ambari-agent and ambari-server to be re-installed and re-started on all nodes in the cluster, even if they are already running. The default is to install the ambari-agent on all nodes where it is not running, and to install the ambari-server if it too is not running. For a new cluster, the agents and ambari-server will be installed. When adding nodes to an existing cluster, the new nodes will have the ambari-agent installed and started, and the existing nodes, by default, will not have the agent nor ambari-server re-installed. For verifying an existing cluster, by default, the ambari-agent and ambari-server willnotbe re-installed. However, if--force-ambari-updateis specified the ambari-agents and ambari-server will always be installed/re-installed.--hadoop-mgmt-nodeis the hostname of the ambari management server. Default is localhost.--yarn-masteris the hostname of the YARN resource manager server. Default is localhost.
Given below is an example of running the setup_cluster.sh script on the Ambari Management server and four Red Hat Gluster Storage Nodes which have the same logical volume and mount point intended to be used as a Red Hat Gluster Storage brick../setup_cluster.sh --yarn-master yarn.hdp rhs-1.hdp:/mnt/brick1:/dev/rhs_vg1/rhs_lv1 rhs-2.hdp rhs-3.hdp rhs-4.hdp
Note
If a brick mount is omitted, the brick mount of the first node is used and if one block device is omitted, the block device of the first node is used.
7.2.3. Creating Volumes for use with Hadoop
Note
If an existing Red Hat Gluster Storage volume is used with Hadoop, skip this section and continue with the instruction in the next section.
Whether you have a new or existing Red Hat Gluster Storage trusted storage pool, to create a volume for use with Hadoop, the volume needs to be created in such a way as to support Hadoop workloads. The supported volume configuration for Hadoop is Distributed Replicated volume with replica count of 2 or 3. You must not name the Hadoop enabled Red Hat Gluster Storage volume as
hadoop or mapredlocal.
Run the script given below to create new volumes that you intend to use with Hadoop. The script provides the necessary configuration parameters to the volume as well as updates the Hadoop Configuration to make the volume accessible to Hadoop.
- Open the terminal window of the server designated to be the Ambari Management Server and navigate to the
/usr/share/rhs-hadoop-install/directory. - Run the hadoop cluster configuration script as given below:
create_vol.sh [-y][--quiet | --verbose | --debug] VOLNAME [--replica
count] <volMountPrefix> <node-list>where-ycauses all prompts to be auto-answered "yes". The default is that the user must respond to each prompt.--quietis the default and produces the least output from the script.--verboseoutputs more information about the steps taken by the script.--debugis the greatest level of output and is as seen in the/var/log/rhs-hadoop-install.loglog file.Note
The/var/log/rhs-hadoop-install.loglog file contains the--debuglevel of detailed information regardless of the verbose level chosen when running the script.--replicacount is the replica count. You can specify the replica count as 2 or 3. By default, the replica count is 2. The number of bricks must be a multiple of the replica count. The order in which bricks are specified determines how bricks are mirrored with each other. For example, first n bricks, where n is the replica count.<node-list>is: <node1>:<brickmnt> <node2>[:<brickmnt2>] <node3>[:<brickmnt3>] ... [<nodeN>[:<brickmntN>VOLNAMEis the name of the new Red Hat Gluster Storage volume. By default, theperformance.stat-prefetch=off,cluster.eager-lock=on, andperformance.quick-read=offperformance related options are set on the volume. The new volume will be mounted on all storage nodes, even nodes not directly spanned by the volume, and on the yarn-master node.- volMountPrefix is the name of the gluster-fuse mount path without the volume name. For example,
/mnt/glusterfsor/distributed. brickmntis the name of the XFS mount for the block devices used by the above nodes, for example,/mnt/brick1or/external/HadoopBrick. When a Red Hat Gluster Storage volume is created its bricks will have the volume name appended, sobrickmntis a prefix for the volume's bricks. For example, if a new volume is namedHadoopVolthen its brick list would be:<node>:/mnt/brick1/HadoopVolor<node>:/external/HadoopBrick/HadoopVol.
Note
The node-list forcreate_vol.shis similar to thenode-list-specused bysetup_cluster.shexcept that a block device is not specified increate_vol.Given below is an example on how to create a volume named HadoopVol, using four Red Hat Gluster Storage Servers, each with the same brick mount and mount the volume on/mnt/glusterfs./create_vol.sh HadoopVol /mnt/glusterfs rhs-1.hdp:/mnt/brick1 rhs-2.hdp rhs-3.hdp rhs-4.hdp
7.2.4. Deploying and Configuring the HDP 2.1 Stack on Red Hat Gluster Storage using Ambari Manager
Prerequisite
Before deploying and configuring the HDP stack, perform the following steps:
- Open the terminal window of the server designated to be the Ambari Management Server and replace the
HDP 2.1.GlusterFS repoinfo.xmlfile by theHDP 2.1 repoinfo.xmlfile.cp /var/lib/ambari-server/resources/stacks/HDP/2.1/repos/repoinfo.xml /var/lib/ambari-server/resources/stacks/HDP/2.1.GlusterFS/repos/
You will be prompted to overwrite/2.1.GlusterFS/repos/repoinfo.xmlfile, typeyesto overwrite the file. - Restart the Ambari Server.
# ambari-server restart
Perform the following steps to deploy and configure the HDP stack on Red Hat Gluster Storage:
Important
This section describes how to deploy HDP on Red Hat Gluster Storage. Selecting
HDFS as the storage selection in the HDP 2.1.GlusterFS stack is not supported. If you want to deploy HDFS, then you must select the HDP 2.1 stack (not HDP 2.1.GlusterFS) and follow the instructions of the Hortonworks documentation.
Ensure to select only the
2.1.GlusterFS supported stack. The other unsupported *GlusterFS stacks might be available for selection.
- Launch a web browser and enter
http://hostname:8080in the URL by replacing hostname with the hostname of your Ambari Management Server.Note
If the Ambari Console fails to load in the browser, it is usually because iptables is still running. Stop iptables by opening a terminal window and runservice iptables stopcommand. - Enter
adminandadminfor the username and password. - Assign a name to your cluster, such as
MyCluster. - Select the
HDP 2.1 GlusterFS Stack(if not already selected by default) and clickNext. - On the
Install Optionsscreen:- For
Target Hosts, add the YARN server and all the nodes in the trusted storage pool. - Select
Provide your SSH Private Key to automatically register hostsand provide your Ambari Server private key that was used to set up passwordless-SSH across the cluster. - Click
Register and Confirmbutton. It may take a while for this process to complete.
- For
Confirm Hosts, it may take awhile for all the hosts to be confirmed.- After this process is complete, you can ignore any warnings from the Host Check related to File and Folder Issues, Package Issues and User Issues as these are related to customizations that are required for Red Hat Gluster Storage.
- Click
Nextand ignore the Confirmation Warning.
- For
Choose Services, unselect HDFS and as a minimum select GlusterFS, Ganglia, YARN+MapReduce2, ZooKeeper and Tez.Note
- The use of Storm and Falcon have not been extensively tested and as yet are not supported.
- Do not select the Nagios service, as it is not supported. For more information, see subsection 21.1. Deployment Scenarios of chapter 21. Administering the Hortonworks Data Platform on Red Hat Gluster Storage in the Red Hat Gluster Storage 3.0 Administration Guide.
- This section describes how to deploy HDP on Red Hat Gluster Storage. Selecting
HDFSas the storage selection in the HDP 2.1 GlusterFS stack is not supported. If users wish to deploy HDFS, then they must select the HDP 2.1 (not HDP 2.1.GlusterFS) and follow the instructions in the Hortonworks documentation.
- For
Assign Masters, set all the services to your designated YARN Master Server.- For ZooKeeper, select your YARN Master Server and at least 2 additional servers within your cluster.
- Click
Nextto proceed.
- For
Assign Slaves and Clients, select all the nodes asNodeManagersexcept the YARN Master Server.- Click
Clientcheckbox for each selected node. - Click
Nextto proceed.
- On the
Customize Servicesscreen:- Click YARN tab, scroll down to the yarn.nodemanager.log-dirs and yarn.nodemanager.local-dirs properties and remove any entries that begin with
/mnt/glusterfs/.Important
New Red Hat Gluster Storage and Hadoop Clusters use the naming convention of/mnt/glusterfs/volnameas the mount point for Red Hat Gluster Storage volumes. If you have existing Red Hat Gluster Storage volumes that has been created with different mount points, then remove the entries of those mount points. - Update the following property on the YARN tab - Application Timeline Server section:
Key Value yarn.timeline-service.leveldb-timeline-store.path /tmp/hadoop/yarn/timeline - Review other tabs that are highlighted in red. These require you to enter additional information, such as passwords for the respective services.
- On the
Reviewscreen, review your configuration and then clickDeploybutton. - On the
Summaryscreen, click theCompletebutton and ignore any warnings and the statement. This is normal as there is still some addition configuration that is required before we can start the services. - Click
Nextto proceed to the Ambari Dashboard. Select the YARN service on the top left and clickStop-All. Do not clickStart-Alluntil you perform the steps in section Section 7.5, “Verifying the Configuration”.
7.2.5. Enabling Existing Volumes for use with Hadoop
Important
This section is mandatory for every volume you intend to use with Hadoop. It is not sufficient to run the
create_vol.sh script, you must follow the steps listed in this section as well.
If you have a volume that you would like to analyze with Hadoop, and the volume was created by the above create_vol.sh script, then it must be enabled to support Hadoop workloads. Execute the
enable_vol.sh script below to validate the volume's setup and to update Hadoop's core-site.xml configuration file which makes the volume accessible to Hadoop.
If you have a volume that was not created by the above
create_vol.sh script, it is important to ensure that both the bricks and the volumes that you intend to use are properly mounted and configured. If they are not, the enable_vol.sh script will display and log volume configuration errors. Perform the following steps to mount and configure bricks and volumes with required parameters on all storage servers:
- Bricks need to be an XFS formatted logical volume and mounted with the
noatimeandinode64parameters. For example, if we assume the logical volume path is/dev/rhs_vg1/rhs_lv1and that path is being mounted on/mnt/brick1then the/etc/fstabentry for the mount point should look as follows:/dev/rhs_vg1/rhs_lv1 /mnt/brick1 xfs noatime,inode64 0 0
- Volumes must be mounted with the
_netdevsetting. Assuming your volume name isHadoopVol, the server's FQDN isrhs-1.hdpand your intended mount point for the volume is/mnt/glusterfs/HadoopVolthen the/etc/fstabentry for the mount point of the volume must be as follows:rhs-1.hdp:/HadoopVol /mnt/glusterfs/HadoopVol glusterfs _netdev 0 0
Volumes that are to be used with Hadoop also need to have specific volume level parameters set on them. In order to set these, shell into a node within the appropriate volume's trusted storage pool and run the following commands (the examples assume the volume name is HadoopVol):# gluster volume set HadoopVol performance.stat-prefetch off # gluster volume set HadoopVol cluster.eager-lock on # gluster volume set HadoopVol performance.quick-read off
- Perform the following to create several Hadoop directories on that volume:
- Open the terminal window of one of the Red Hat Gluster Storage nodes in the trusted storage pool and navigate to the
/usr/share/rhs-hadoop-installdirectory. - Run the
bin/add_dirs.sh volume-mount-dir , list-of-directories, where volume-mount-dir is the path name for the glusterfs-fuse mount of the volume you intend to enable for Hadoop (including the name of the volume) and list-of-directories is the list generated by runningbin/gen_dirs.sh -dscript. For example:# bin/add_dirs.sh /mnt/glusterfs/HadoopVol $(bin/gen_dirs.sh -d)
After completing these 3 steps, you are now ready to run the
enable_vol.sh script.
Red Hat Gluster Storage-Hadoop has the concept of a
default volume, which is the volume used when input and/or output URIs are unqualified. Unqualified URIs are common in Hadoop jobs, so defining the default volume, which can be set by enable_vol.sh script, is important. The default volume is the first volume appearing in the fs.glusterfs.volume property in the /etc/hadoop/conf/core-site.xml configuration file. The enable_vol.sh supports the --make-default option which, if specified, causes the supplied volume to be pre-pended to the above property and thus become the default volume. The default behavior for enable_vol.sh is to not make the target volume the default volume, meaning the volume name is appended, rather than prepended, to the above property value.
The
--user and --pass options are required for the enable_vol.sh script to login into Ambari instance of the cluster to reconfigure Red Hat Gluster Storage volume related configuration.
Note
The supported volume configuration for Hadoop is Distributed Replicated volume with replica count of 2 or 3. Also, when you run the
enable_vol script for the first time, you must specify the --make-default option.
- Open the terminal window of the server designated to be the Ambari Management Server and navigate to the
/usr/share/rhs-hadoop-install/directory. - Run the Hadoop Trusted Storage pool configuration script as given below:
# enable_vol.sh [-y] [--quiet | --verbose | --debug] [--make-default] [--hadoop-mgmt-node node] [--yarn-master yarn-node][--rhs-node storage-node] [--user ambari-admin-user] [--pass admin-password] VOLNAME
For Example;# enable_vol.sh --yarn-master yarn.hdp --rhs-node rhs-1.hdp HadoopVol --make-default
- VOLNAME is the name of the Red Hat Gluster Storage volume.
--yarn-masteris the hostname of the YARN resource manager server. Default is localhost.--rhs-nodeis the name of any of the existing Red Hat Gluster Storage nodes in the cluster. It is required unless this script is being run from a storage node. This value is necessary in order to run the gluster CLI.--userand--passare required to update the hadoop configuration files (core-site.xml) residing on each node spanned by the volume. You must updatecore-site.xmlfile for a volume to be visible to Hadoop jobs. These options are defaulted to the Ambari defaults.--make-defaultindicates that VOLNAME is to be made the default volume by pre-pending it to thecore-site.xmlvolumes list property. The default behavior is to not alter the default volume name incore-site.xmlfile.--quietis the default and produces the least output from the script.--verboseoutputs more information about the steps taken by the script.--debugis the greatest level of output and is same as seen in the/var/log/rhs-hadoop-install.loglog file.Note
The/var/log/rhs-hadoop-install.loglog file contains the--debuglevel of detailed information regardless of the verbose level chosen when running the script.-ycauses all prompts to be auto-answered "yes". The default is that the user must respond to each prompt.VOLNAMEis the name of the new Red Hat Gluster Storage volume.
Note
If--yarn-masterand/or--rhs-nodeoptions are omitted then the default of localhost (the node from which the script is being executed) is assumed. Example:./enable_vol.sh --yarn-master yarn.hdp --rhs-node rhs-1.hdp HadoopVol --make-default

Where did the comment section go?
Red Hat's documentation publication system recently went through an upgrade to enable speedier, more mobile-friendly content. We decided to re-evaluate our commenting platform to ensure that it meets your expectations and serves as an optimal feedback mechanism. During this redesign, we invite your input on providing feedback on Red Hat documentation via the discussion platform.