7.6. Troubleshooting

This section describes the most common troubleshooting scenarios related to Hadoop and Red Hat Gluster Storage integration.
Deployment of HDP 2.1 on an LDAP enabled cluster fails with "Execution of 'groupmod hadoop' returned 10. groupmod: group 'hadoop' does not exist in /etc/group"

This is due to a bug caused by Ambari expecting a local hadoop group on an LDAP enabled cluster. Due to the fact the users and groups are centrally managed with LDAP, Ambari is not able to find the group. In order to resolve this issue:

  1. Shell into the Ambari Server and navigate to /var/lib/ambari-server/resources/scripts
  2. Replace the $AMBARI-SERVER-FQDN with the FQDN of your Ambari Server and the $AMBARI-CLUSTER-NAME with the cluster name that you specified for your cluster within Ambari and run the following command:
    ./configs.sh set $AMBARI-SERVER-FQDN $AMBARI-CLUSTER-NAME global ignore_groupsusers_create "true"
  3. In the Ambari console, click Retry in the Cluster Installation Wizard.

The WebHCAT service does not start

This is due to a permissions bug in WebHCAT. In order to start the service, it must be restarted multiple times and requires several file permissions to be changed. To resolve this issue, begin by starting the service. After each start attempt, WebHCAT will attempt to copy a different jar with root permissions. Every time it does this you need to chmod 755 the jar file in /mnt/glusterfs/HadoopVolumeName/apps/webhcat. The three files it copies to this directory are hadoop-streaming-2.4.0.2.1.5.0-648.jar, HDP-webhcat/hive.tar.gz and HDP-webhcat/pig.tar.gz. After you have set the permissions on all three files, the service will start and be operational on the fourth attempt.

Exception stating that “job.jar changed on src file system” or "job.xml changed on src file system".

This error occurs if the clocks are not synchronized across the trusted storage pool. The time in all the servers must be uniform in the trusted storage pool. It is recommended to set up a NTP (Network Time Protocol) service to keep the bricks' time synchronized, and avoid out-of-time synchronization effects.

For more information on configuring NTP, see https://access.redhat.com/documentation/en-US/Red_Hat_Enterprise_Linux/6/html/Migration_Planning_Guide/sect-Migration_Guide-Networking-NTP.html
While running a Hadoop job, if FileNotFoundException exception is displayed with jobtoken does not exist message:

This error occurs when the user IDs(UID) and group IDs(GID) are not consistent across the trusted storage pool. For example, user "tom" has a UID of 1002 on server1, but on server2, the user tom has a UID of 1003. The simplest and recommended approach is to leverage LDAP authentication to resolve this issue. After creating the necessary users and groups on an LDAP server, the servers within the trusted storage pool can be configured to use the LDAP server for authentication. For more information on configuring authentication, see Chapter 12. Configuring Authentication of Red Hat Enterprise Linux 6 Deployment Guide.