Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce, mail # user - MRv2 jobs fail when run with more than one slave


+
Trevor 2012-07-17, 21:24
+
Karthik Kambatla 2012-07-17, 23:18
+
Trevor 2012-07-17, 23:33
+
Trevor 2012-07-18, 00:33
+
Arun C Murthy 2012-07-18, 01:25
Copy link to this message
-
Re: MRv2 jobs fail when run with more than one slave
Arun C Murthy 2012-07-17, 23:04
Trevor,

 It's hard for folks here to help you with CDH patchsets (it's their call on what they include), can you pls try with vanilla Apache hadoop-2.0.0-alpha and I'll try helping out?

thanks,
Arun

On Jul 17, 2012, at 2:24 PM, Trevor wrote:

> Hi all,
>
> I recently upgraded from CDH4b2 (0.23.1) to CDH4 (2.0.0). Now for some strange reason, my MRv2 jobs (TeraGen, specifically) fail if I run with more than one slave. For every slave except the one running the Application Master, I get the following failed tasks and warnings repeatedly:
>
> 12/07/13 14:21:55 INFO mapreduce.Job: Running job: job_1342207265272_0001
> 12/07/13 14:22:17 INFO mapreduce.Job: Job job_1342207265272_0001 running in uber mode : false
> 12/07/13 14:22:17 INFO mapreduce.Job:  map 0% reduce 0%
> 12/07/13 14:22:46 INFO mapreduce.Job:  map 1% reduce 0%
> 12/07/13 14:22:52 INFO mapreduce.Job:  map 2% reduce 0%
> 12/07/13 14:22:55 INFO mapreduce.Job:  map 3% reduce 0%
> 12/07/13 14:22:58 INFO mapreduce.Job:  map 4% reduce 0%
> 12/07/13 14:23:04 INFO mapreduce.Job:  map 5% reduce 0%
> 12/07/13 14:23:07 INFO mapreduce.Job:  map 6% reduce 0%
> 12/07/13 14:23:07 INFO mapreduce.Job: Task Id : attempt_1342207265272_0001_m_000004_0, Status : FAILED
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stdout
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000004_0&filter=stderr
> 12/07/13 14:23:08 INFO mapreduce.Job: Task Id : attempt_1342207265272_0001_m_000003_0, Status : FAILED
> 12/07/13 14:23:08 WARN mapreduce.Job: Error reading task output Server returned HTTP response code: 400 for URL: http://
> perfgb0n0:8080/tasklog?plaintext=true&attemptid=attempt_1342207265272_0001_m_000003_0&filter=stdout
> ...
> 12/07/13 14:25:12 INFO mapreduce.Job:  map 25% reduce 0%
> 12/07/13 14:25:12 INFO mapreduce.Job: Job job_1342207265272_0001 failed with state FAILED due to:
> ...
>                 Failed map tasks=19
>                 Launched map tasks=31
>
> The HTTP 400 error appears to be generated by the ShuffleHandler, which is configured to run on port 8080 of the slaves, and doesn't understand that URL. What I've been able to piece together so far is that /tasklog is handled by the TaskLogServlet, which is part of the TaskTracker. However, isn't this an MRv1 class that shouldn't even be running in my configuration? Also, the TaskTracker appears to run on port 50060, so I don't know where port 8080 is coming from.
>
> Though it could be a red herring, this warning seems to be related to the job failing, despite the fact that the job makes progress on the slave running the AM. The Node Manager logs on both AM and non-AM slaves appear fairly similar, and I don't see any errors in the non-AM logs.
>
> Another strange data point: These failures occur running the slaves on ARM systems. Running the slaves on x86 with the same configuration works. I'm using the same tarball on both, which means that the native-hadoop library isn't loaded on ARM. The master/client is the same x86 system in both scenarios. All nodes are running Ubuntu 12.04.
>
> Thanks for any guidance,
> Trevor
>

--
Arun C. Murthy
Hortonworks Inc.
http://hortonworks.com/