Chapter 5. Register the Required Components
OpenStack Data Processing requires a Hadoop image containing the necessary elements to launch and use Hadoop clusters. Specifically, Red Hat OpenStack Platform requires an image containing Red Hat Enterprise Linux with the necessary data processing plug-in.
Once you have a Hadoop image suitable for the jobs you wish to run, register it to the OpenStack Data Processing service. To do so:
- Upload the image to the Image service. For instructions on how to do so, see Upload an Image.
- After uploading the image, select Project > Data Processing > Image Registry in the dashboard.
- Click Register Image, and select the Hadoop image from the Image drop-down menu.
- Enter the user name that the OpenStack Data Processing service should use to apply settings and manage processes on each instance/node. The user name set for this purpose on the official images provided by Red Hat Enterprise Linux (which you used in Chapter 4, Create Hadoop Image) is cloud-user.
By default, the OpenStack Data Processing service will add the necessary plug-in and version tags in the plug-in and Version drop-down menu. Verify that the tag selection is correct, then click Add plugin tags to add them. The OpenStack Data Processing service also allows you to use custom tags to either differentiate or group registered images. Use the Add custom tag button to add a tag; tags appear in the box under the Description field.
To remove a custom tag, click the x beside its name.
- Click Done. The image should now appear in the Image Registry table.
5.1. Register Input and Output Data Sources
After registering an image, register your data input source and output destination. You can register both as objects from the Object Storage service; as such, you need to upload both as objects first. For instructions on how to do so, see Upload an Object.
You can also register data objects straight from another Hadoop-compatible distributed file system (for example, HDFS). For information on how to upload data to your chosen distributed file system, see its documentation.
- In the dashboard, select Project > Data Processing > Data Sources.
- Click Create Data Source. Enter a name for your data source in the Name field.
- Use the Description field to describe the data source (optional).
Select your data source’s type and URL. The procedure for doing so depends on your source’s location:
If your data is located in the Object Storage service, select Swift from the Data Source Type drop-down menu. Then:
- Provide the container and object name of your data source as swift://CONTAINER/OBJECT in the URL field.
- If your data source requires a login, supply the necessary credentials in the Source username and Source password fields.
If your data is located in a Hadoop Distributed File System (HDFS), select the corresponding source from the Data Source Type drop-down menu. Then, enter the data source’s URL in the URL field as hdfs://HDFSHOST:PORT/OBJECTPATH, where:
- HDFSHOST is the host name of the HDFS host.
- PORT is the port on which the data source is accessible.
- OBJECTPATH is the available path to the data source on HDFSHOST.
If your data is located in an S3 object store, select the corresponding source from the Data Source Type drop-down menu. Then, enter the data source URL in the URL field in the format s3://bucket/path/to/object.
If you have not already configured the following parameters in the cluster configuration or job execution settings, you must configure them here:
- S3 access key
- S3 secret key
- S3 endpoint is the URL of the S3 service without the protocol.
- Use SSL which must be a boolean value.
- Use bucket in path indicates virtual-hosted or path URLs and must be a boolean value.
- Click Done. The data source should now be available in the Data Sources table.
Perform this procedure for each data input/output object required for your jobs.