26.4. Running Hadoop Jobs Across Multiple Red Hat Gluster Storage Volumes

If you are already running Hadoop Jobs on a volume and wish to enable Hadoop on existing additional Red Hat Gluster Storage Volumes, then you must follow the steps in the Enabling Existing Volumes for use with Hadoop section in Deploying the Hortonworks Data Platform on Red Hat Gluster Storage chapter, in the Red Hat Gluster Storage 3.1 Installation Guide . If you do not have an additional volume and wish to add one, then you must first complete the procedures mentioned in the Creating volumes for use with Hadoop section and then the procedures mentioned in Enabling Existing Volumes for use with Hadoop section. This will configure the additional volume for use with Hadoop.
Specifying volume specific paths when running Hadoop Jobs

When you specify paths in a Hadoop Job, the full URI of the path is required. For example, if you have a volume named VolumeOne and that must pass in a file called myinput.txt in a directory named input, then you would specify it as glusterfs://VolumeOne/input/myinput.txt, the same formatting goes for the output. The example below shows data read from a path on VolumeOne and written to a path on VolumeTwo.

# bin/hadoop jar /opt/HadoopJobs.jar ProcessLogs glusterfs://VolumeOne/input/myinput.txt glusterfs://VolumeTwo/output/

Note

The very first Red Hat Gluster Storage volume that is configured for using with Hadoop is the Default Volume. This is usually the volume name you specified when you went through the Installation Guide. The Default Volume is the only volume that does not require a full URI to be specified and is allowed to use a relative path. Thus, assuming your default volume is called HadoopVol, both glusterfs://HadoopVol/input/myinput.txt and /input/myinput.txt are processed the same when providing input to a Hadoop Job or using the Hadoop CLI.