Did you look at task logs to see why those tasks failed? Since it's a
back-end error, the console output doesn't tell you much. Task logs should
have a stack trace that shows why it failed, and you can go from there.
On Fri, Apr 12, 2013 at 8:18 AM, Mua Ban <[EMAIL PROTECTED]> wrote:
> I am very new to PIG/Hadoop, I just started writing my first PIG script a
> couple days ago. I ran into this problem.
> My cluster has 9 nodes. I have to join two data sets big and small, each is
> collected for 4 weeks. I first take two subsets of my data set (which is
> for the first week of data), let's call them B1 and S1 for big and small
> data sets of the first week. The entire data sets of 4 weeks is B4 and S4.
> I ran my script on my cluster to join B1 and S1 and everything is fine. I
> got my joined data. However, when I ran my script to join B4 and S4, the
> script failed. B4 is 39GB, S4 is 300MB. B4 is skewed, some id appears more
> frequently than others. I tried both 'using skewed' and 'using replicated'
> modes for the join operation (by appending them to the end of the below
> join clause), they both fail.
> Here is my script and i think it is very simple:
> *big = load 'bigdir/' using PigStorage(',') as (id:chararray,
> *small = load 'smalldir/' using PigStorage(',') as
> *J = JOIN big by id LEFT OUTER, small by id;
> *store J into 'outputdir' using PigStorage(',');
> On the web ui of the tracker, I see that the job has 40 reducers (I guess
> since the total data is about 40GB, and each 1GB will need one reducer by
> default of PIG and hadoop setting, so this is normal). If I use 'parallel
> 80' in the join operation above, then I see 80 reducers, and the join
> operation still failed.
> I checked file mapred-default.xml and found this:
> If I set the value of parallel in join operation, it should overwrite this,
> On the tracker GUI, I see that for different runs, the number of completed
> reducers changes from 4 to 10 (out of 40 total reducers). The tracker GUI
> shows the reason for the failed reducers: "Task
> attempt_201304081613_0046_r_000006_0 failed to report status for 600
> seconds. Killing!"
> *Could you please help?*
> Thank you very much,
> Here is the error report from the console screen where I ran this script:
> job_201304081613_0032 616 0 230 12 32 0 0
> 0 big MAP_ONLY
> job_201304081613_0033 705 1 21 6 6 234 2
> 34 234 SAMPLER
> Failed Jobs:
> JobId Alias Feature Message Outputs
> job_201304081613_0034 small SKEWED_JOIN Message: Job failed!
> Error - # of failed Reduce Tasks exceeded allowed limit. FailedCount: 1.
> LastFailedTask: task_201304081613_0034_r_000012
> Successfully read 364285458 records (39528533645 bytes) from:
> Failed to read data from "hdfs://d0521b01:24990/user/abc/small/"
> Total records written : 0
> Total bytes written : 0
> Spillable Memory Manager spill count : 0
> Total bags proactively spilled: 0
> Total records proactively spilled: 0
> Job DAG:
> job_201304081613_0032 -> job_201304081613_0033,
> job_201304081613_0033 -> job_201304081613_0034,
> job_201304081613_0034 -> null,
> 2013-04-10 20:11:23,815 [main] WARN
> - Encountered Warning
> REDUCER_COUNT_LOW 1 time(s).
> 2013-04-10 20:11:23,815 [main] INFO
> - Some jobs have faile
> d! Stop running all dependent jobs
> 2013-04-10 20:11:23,815 [main] ERROR org.apache.pig.tools.grunt.GruntParser