Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Child Error


Hello, I have a 20 node Hadoop cluster where each node has 8GB memory and
an 8-core processor. I sometimes get the following error on a random basis:
-----------------------------------------------------------------------------------------------------------

Exception in thread "main" java.io.IOException: Exception reading
file:/var/tmp/jim/hadoop-jim/mapred/local/taskTracker/jim/jobcache/job_201305231647_0005/jobToken
at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:135)
at org.apache.hadoop.mapreduce.security.TokenCache.loadTokens(TokenCache.java:165)
at org.apache.hadoop.mapred.Child.main(Child.java:92)
Caused by: java.io.IOException: failure to login
at org.apache.hadoop.security.UserGroupInformation.getLoginUser(UserGroupInformation.java:501)
at org.apache.hadoop.security.UserGroupInformation.getCurrentUser(UserGroupInformation.java:463)
at org.apache.hadoop.fs.FileSystem$Cache$Key.<init>(FileSystem.java:1519)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:1420)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:254)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:187)
at org.apache.hadoop.security.Credentials.readTokenStorageFile(Credentials.java:129)
... 2 more
Caused by: javax.security.auth.login.LoginException:
java.lang.NullPointerException: invalid null input: name
at com.sun.security.auth.UnixPrincipal.<init>(UnixPrincipal.java:70)
at com.sun.security.auth.module.UnixLoginModule.login(UnixLoginModule.java:132)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:57)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)

......

-----------------------------------------------------------------------------------------------------------

This does not always happen but I see a pattern when the intermediate data
is larger, it tends to occur more frequently. In the web log, I can see the
following:

java.lang.Throwable: Child Error
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:271)
Caused by: java.io.IOException: Task process exit with nonzero status of 1.
at org.apache.hadoop.mapred.TaskRunner.run(TaskRunner.java:258)

>From what I read online, a possible cause is when there is not enough
memory for all JVM's. My mapred site.xml is set up to allocate 1100MB for
each child and the maximum number of map and reduce tasks are set to 3 - So
6600MB of the child JVMs + (500MB * 2) for the data node and task tracker
(as I set HADOOP_HEAP to 500 MB). I feel like memory is not the cause but I
couldn't avoid it so far.
In case it helps, here are the relevant sections of my mapred-site.xml

-----------------------------------------------------------------------------------------------------------

    <name>mapred.tasktracker.map.tasks.maximum</name>
    <value>3</value>

    <name>mapred.tasktracker.reduce.tasks.maximum</name>
    <value>3</value>

    <name>mapred.child.java.opts</name>
    <value>-Xmx1100M -ea -XX:+HeapDumpOnOutOfMemoryError
-XX:HeapDumpPath=/var/tmp/soner</value>

    <name>mapred.reduce.parallel.copies</name>
    <value>5</value>

    <name>tasktracker.http.threads</name>
    <value>80</value>
-----------------------------------------------------------------------------------------------------------

My jobs still complete most of the time though they occasionally fail and
I'm really puzzled at this point. I'd appreciate any help or ideas.

Thanks
+
Jean-Marc Spaggiari 2013-05-25, 03:32
+
Jean-Marc Spaggiari 2013-05-28, 14:24