Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.


+
bmdevelopment 2010-06-24, 19:29
+
Hemanth Yamijala 2010-06-25, 04:40
+
bmdevelopment 2010-06-25, 14:56
+
bmdevelopment 2010-07-05, 07:11
Copy link to this message
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
Hi,

Sorry, I couldn't take a close look at the logs until now.
Unfortunately, I could not see any huge difference between the success
and failure case. Can you please check if things like basic hostname -
ip address mapping are in place (if you have static resolution of
hostnames set up) ? A web search is giving this as the most likely
cause users have faced regarding this problem. Also do the disks have
enough size ? Also, it would be great if you can upload your hadoop
configuration information.

I do think it is very likely that configuration is the actual problem
because it works in one case anyway.

Thanks
Hemanth

On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[EMAIL PROTECTED]> wrote:
> Hello,
> I still have had no luck with this over the past week.
> And even get the same exact problem on a completely different 5 node cluster.
> Is it worth opening an new issue in jira for this?
> Thanks
>
>
> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[EMAIL PROTECTED]> wrote:
>> Hello,
>> Thanks so much for the reply.
>> See inline.
>>
>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote:
>>> Hi,
>>>
>>>> I've been getting the following error when trying to run a very simple
>>>> MapReduce job.
>>>> Map finishes without problem, but error occurs as soon as it enters
>>>> Reduce phase.
>>>>
>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id :
>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED
>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.
>>>>
>>>> I am running a 5 node cluster and I believe I have all my settings correct:
>>>>
>>>> * ulimit -n 32768
>>>> * DNS/RDNS configured properly
>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM
>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW
>>>>
>>>> The program is very simple - just counts a unique string in a log file.
>>>> See here: http://pastebin.com/5uRG3SFL
>>>>
>>>> When I run, the job fails and I get the following output.
>>>> http://pastebin.com/AhW6StEb
>>>>
>>>> However, runs fine when I do *not* use substring() on the value (see
>>>> map function in code above).
>>>>
>>>> This runs fine and completes successfully:
>>>>            String str = val.toString();
>>>>
>>>> This causes error and fails:
>>>>            String str = val.toString().substring(0,10);
>>>>
>>>> Please let me know if you need any further information.
>>>> It would be greatly appreciated if anyone could shed some light on this problem.
>>>
>>> It catches attention that changing the code to use a substring is
>>> causing a difference. Assuming it is consistent and not a red herring,
>>
>> Yes, this has been consistent over the last week. I was running 0.20.1
>> first and then
>> upgrade to 0.20.2 but results have been exactly the same.
>>
>>> can you look at the counters for the two jobs using the JobTracker web
>>> UI - things like map records, bytes etc and see if there is a
>>> noticeable difference ?
>>
>> Ok, so here is the first job using write.set(value.toString()); having
>> *no* errors:
>> http://pastebin.com/xvy0iGwL
>>
>> And here is the second job using
>> write.set(value.toString().substring(0, 10)); that fails:
>> http://pastebin.com/uGw6yNqv
>>
>> And here is even another where I used a longer, and therefore unique string,
>> by write.set(value.toString().substring(0, 20)); This makes every line
>> unique, similar to first job.
>> Still fails.
>> http://pastebin.com/GdQ1rp8i
>>
>>>Also, are the two programs being run against
>>> the exact same input data ?
>>
>> Yes, exactly the same input: a single csv file with 23K lines.
>> Using a shorter string leads to more like keys and therefore more
>> combining/reducing, but going
>> by the above it seems to fail whether the substring/key is entirely
>> unique (23000 combine output records) or
>> mostly the same (9 combine output records).
>>
>>>
>>> Also, since the cluster size is small, you could also look at the
>>> tasktracker logs on the machines where the maps have run to see if
+
bmdevelopment 2010-07-07, 08:02
+
bmdevelopment 2010-07-08, 06:49
+
Ted Yu 2010-07-08, 17:38
+
Todd Lipcon 2010-07-08, 18:22
+
bmdevelopment 2010-07-09, 03:26
+
Ted Yu 2010-07-09, 04:54
+
bmdevelopment 2010-07-09, 09:07
+
Ted Yu 2010-07-09, 13:18
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB