Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Re: A question about Hadoop 1 job user id used for group mapping, which could lead to performance degradatioin


Copy link to this message
-
Re: A question about Hadoop 1 job user id used for group mapping, which could lead to performance degradatioin
It just seems like lazy code. You can see that, later, there is this:

{code}

        for(Token<?> token : UserGroupInformation.getCurrentUser().getTokens()) {
          childUGI.addToken(token);
        }

{code}

So eventually the JobToken is getting added to the UGI which runs task-code.

>  WARN org.apache.hadoop.security.UserGroupInformation (IPC Server handler 63 on 9000): No groups available for user job_201401071758_0002

This seems to be a problem. When the task tries to reach the NameNode, it should do so as the user, not the job-id. It is not just logging, I'd be surprised if jobs pass. Do you have permissions enabled on HDFS?

Oh, or is this in non-secure mode (i.e. without kerberos)?

+Vinod
On Jan 7, 2014, at 5:14 PM, Jian Fang <[EMAIL PROTECTED]> wrote:

> Hi,
>
> I looked at Hadoop 1.X source code and found some logic that I could not understand.
>
> In the org.apache.hadoop.mapred.Child class, there were two UGIs defined as follows.
>
>     UserGroupInformation current = UserGroupInformation.getCurrentUser();
>     current.addToken(jt);
>
>     UserGroupInformation taskOwner
>      = UserGroupInformation.createRemoteUser(firstTaskid.getJobID().toString());
>     taskOwner.addToken(jt);
>
> But it is the taskOwner that is actually passed as a UGI to task tracker and then to HDFS. The first one was not referenced any where.
>
>     final TaskUmbilicalProtocol umbilical =
>       taskOwner.doAs(new PrivilegedExceptionAction<TaskUmbilicalProtocol>() {
>         @Override
>         public TaskUmbilicalProtocol run() throws Exception {
>           return (TaskUmbilicalProtocol)RPC.getProxy(TaskUmbilicalProtocol.class,
>               TaskUmbilicalProtocol.versionID,
>               address,
>               defaultConf);
>         }
>     });
>
> What puzzled me is that the job id is actually passed in as the user name to task tracker. On the Name node side, when it tries to map the non-existing user name, i.e., task id, to a group, it always returns empty array. As a result, we always see annoying warning messages such as
>
>  WARN org.apache.hadoop.security.UserGroupInformation (IPC Server handler 63 on 9000): No groups available for user job_201401071758_0002
>
> Sometimes, the warning messages were thrown so fast, hundreds or even thousands per second for a big cluster, the system performance was degraded dramatically.
>
> Could someone please explain why this logic was designed in this way? Any benefit to use non-existing user for the group mapping? Or is this a bug?
>
> Thanks in advance,
>
> John
--
CONFIDENTIALITY NOTICE
NOTICE: This message is intended for the use of the individual or entity to
which it is addressed and may contain information that is confidential,
privileged and exempt from disclosure under applicable law. If the reader
of this message is not the intended recipient, you are hereby notified that
any printing, copying, dissemination, distribution, disclosure or
forwarding of this communication is strictly prohibited. If you have
received this communication in error, please contact the sender immediately
and delete it from your system. Thank You.
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB