|
bmdevelopment
2010-06-24, 19:29
Hemanth Yamijala
2010-06-25, 04:40
bmdevelopment
2010-06-25, 14:56
bmdevelopment
2010-07-05, 07:11
Hemanth Yamijala
2010-07-06, 08:34
bmdevelopment
2010-07-07, 08:02
bmdevelopment
2010-07-08, 06:49
Ted Yu
2010-07-08, 17:38
Todd Lipcon
2010-07-08, 18:22
bmdevelopment
2010-07-09, 03:26
Ted Yu
2010-07-09, 04:54
bmdevelopment
2010-07-09, 09:07
Ted Yu
2010-07-09, 13:18
|
-
Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.bmdevelopment 2010-06-24, 19:29
Hello,
I've been getting the following error when trying to run a very simple MapReduce job. Map finishes without problem, but error occurs as soon as it enters Reduce phase. 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : attempt_201006241812_0001_r_000000_0, Status : FAILED Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. I am running a 5 node cluster and I believe I have all my settings correct: * ulimit -n 32768 * DNS/RDNS configured properly * hdfs-site.xml : http://pastebin.com/xuZ17bPM * mapred-site.xml : http://pastebin.com/JraVQZcW The program is very simple - just counts a unique string in a log file. See here: http://pastebin.com/5uRG3SFL When I run, the job fails and I get the following output. http://pastebin.com/AhW6StEb However, runs fine when I do *not* use substring() on the value (see map function in code above). This runs fine and completes successfully: String str = val.toString(); This causes error and fails: String str = val.toString().substring(0,10); Please let me know if you need any further information. It would be greatly appreciated if anyone could shed some light on this problem. Thanks
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Hemanth Yamijala 2010-06-25, 04:40
Hi,
> I've been getting the following error when trying to run a very simple > MapReduce job. > Map finishes without problem, but error occurs as soon as it enters > Reduce phase. > > 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : > attempt_201006241812_0001_r_000000_0, Status : FAILED > Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > > I am running a 5 node cluster and I believe I have all my settings correct: > > * ulimit -n 32768 > * DNS/RDNS configured properly > * hdfs-site.xml : http://pastebin.com/xuZ17bPM > * mapred-site.xml : http://pastebin.com/JraVQZcW > > The program is very simple - just counts a unique string in a log file. > See here: http://pastebin.com/5uRG3SFL > > When I run, the job fails and I get the following output. > http://pastebin.com/AhW6StEb > > However, runs fine when I do *not* use substring() on the value (see > map function in code above). > > This runs fine and completes successfully: > String str = val.toString(); > > This causes error and fails: > String str = val.toString().substring(0,10); > > Please let me know if you need any further information. > It would be greatly appreciated if anyone could shed some light on this problem. It catches attention that changing the code to use a substring is causing a difference. Assuming it is consistent and not a red herring, can you look at the counters for the two jobs using the JobTracker web UI - things like map records, bytes etc and see if there is a noticeable difference ? Also, are the two programs being run against the exact same input data ? Also, since the cluster size is small, you could also look at the tasktracker logs on the machines where the maps have run to see if there are any failures when the reduce attempts start failing. Thanks Hemanth
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.bmdevelopment 2010-06-25, 14:56
Hello,
Thanks so much for the reply. See inline. On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: > Hi, > >> I've been getting the following error when trying to run a very simple >> MapReduce job. >> Map finishes without problem, but error occurs as soon as it enters >> Reduce phase. >> >> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >> attempt_201006241812_0001_r_000000_0, Status : FAILED >> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >> >> I am running a 5 node cluster and I believe I have all my settings correct: >> >> * ulimit -n 32768 >> * DNS/RDNS configured properly >> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >> * mapred-site.xml : http://pastebin.com/JraVQZcW >> >> The program is very simple - just counts a unique string in a log file. >> See here: http://pastebin.com/5uRG3SFL >> >> When I run, the job fails and I get the following output. >> http://pastebin.com/AhW6StEb >> >> However, runs fine when I do *not* use substring() on the value (see >> map function in code above). >> >> This runs fine and completes successfully: >> String str = val.toString(); >> >> This causes error and fails: >> String str = val.toString().substring(0,10); >> >> Please let me know if you need any further information. >> It would be greatly appreciated if anyone could shed some light on this problem. > > It catches attention that changing the code to use a substring is > causing a difference. Assuming it is consistent and not a red herring, Yes, this has been consistent over the last week. I was running 0.20.1 first and then upgrade to 0.20.2 but results have been exactly the same. > can you look at the counters for the two jobs using the JobTracker web > UI - things like map records, bytes etc and see if there is a > noticeable difference ? Ok, so here is the first job using write.set(value.toString()); having *no* errors: http://pastebin.com/xvy0iGwL And here is the second job using write.set(value.toString().substring(0, 10)); that fails: http://pastebin.com/uGw6yNqv And here is even another where I used a longer, and therefore unique string, by write.set(value.toString().substring(0, 20)); This makes every line unique, similar to first job. Still fails. http://pastebin.com/GdQ1rp8i >Also, are the two programs being run against > the exact same input data ? Yes, exactly the same input: a single csv file with 23K lines. Using a shorter string leads to more like keys and therefore more combining/reducing, but going by the above it seems to fail whether the substring/key is entirely unique (23000 combine output records) or mostly the same (9 combine output records). > > Also, since the cluster size is small, you could also look at the > tasktracker logs on the machines where the maps have run to see if > there are any failures when the reduce attempts start failing. Here is the TT log from the last failed job. I do not see anything besides the shuffle failure, but there may be something I am overlooking or simply do not understand. http://pastebin.com/DKFTyGXg Thanks again! > > Thanks > Hemanth >
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.bmdevelopment 2010-07-05, 07:11
Hello,
I still have had no luck with this over the past week. And even get the same exact problem on a completely different 5 node cluster. Is it worth opening an new issue in jira for this? Thanks On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[EMAIL PROTECTED]> wrote: > Hello, > Thanks so much for the reply. > See inline. > > On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: >> Hi, >> >>> I've been getting the following error when trying to run a very simple >>> MapReduce job. >>> Map finishes without problem, but error occurs as soon as it enters >>> Reduce phase. >>> >>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >>> attempt_201006241812_0001_r_000000_0, Status : FAILED >>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>> >>> I am running a 5 node cluster and I believe I have all my settings correct: >>> >>> * ulimit -n 32768 >>> * DNS/RDNS configured properly >>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >>> * mapred-site.xml : http://pastebin.com/JraVQZcW >>> >>> The program is very simple - just counts a unique string in a log file. >>> See here: http://pastebin.com/5uRG3SFL >>> >>> When I run, the job fails and I get the following output. >>> http://pastebin.com/AhW6StEb >>> >>> However, runs fine when I do *not* use substring() on the value (see >>> map function in code above). >>> >>> This runs fine and completes successfully: >>> String str = val.toString(); >>> >>> This causes error and fails: >>> String str = val.toString().substring(0,10); >>> >>> Please let me know if you need any further information. >>> It would be greatly appreciated if anyone could shed some light on this problem. >> >> It catches attention that changing the code to use a substring is >> causing a difference. Assuming it is consistent and not a red herring, > > Yes, this has been consistent over the last week. I was running 0.20.1 > first and then > upgrade to 0.20.2 but results have been exactly the same. > >> can you look at the counters for the two jobs using the JobTracker web >> UI - things like map records, bytes etc and see if there is a >> noticeable difference ? > > Ok, so here is the first job using write.set(value.toString()); having > *no* errors: > http://pastebin.com/xvy0iGwL > > And here is the second job using > write.set(value.toString().substring(0, 10)); that fails: > http://pastebin.com/uGw6yNqv > > And here is even another where I used a longer, and therefore unique string, > by write.set(value.toString().substring(0, 20)); This makes every line > unique, similar to first job. > Still fails. > http://pastebin.com/GdQ1rp8i > >>Also, are the two programs being run against >> the exact same input data ? > > Yes, exactly the same input: a single csv file with 23K lines. > Using a shorter string leads to more like keys and therefore more > combining/reducing, but going > by the above it seems to fail whether the substring/key is entirely > unique (23000 combine output records) or > mostly the same (9 combine output records). > >> >> Also, since the cluster size is small, you could also look at the >> tasktracker logs on the machines where the maps have run to see if >> there are any failures when the reduce attempts start failing. > > Here is the TT log from the last failed job. I do not see anything > besides the shuffle failure, but there > may be something I am overlooking or simply do not understand. > http://pastebin.com/DKFTyGXg > > Thanks again! > >> >> Thanks >> Hemanth >> >
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Hemanth Yamijala 2010-07-06, 08:34
Hi,
Sorry, I couldn't take a close look at the logs until now. Unfortunately, I could not see any huge difference between the success and failure case. Can you please check if things like basic hostname - ip address mapping are in place (if you have static resolution of hostnames set up) ? A web search is giving this as the most likely cause users have faced regarding this problem. Also do the disks have enough size ? Also, it would be great if you can upload your hadoop configuration information. I do think it is very likely that configuration is the actual problem because it works in one case anyway. Thanks Hemanth On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[EMAIL PROTECTED]> wrote: > Hello, > I still have had no luck with this over the past week. > And even get the same exact problem on a completely different 5 node cluster. > Is it worth opening an new issue in jira for this? > Thanks > > > On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[EMAIL PROTECTED]> wrote: >> Hello, >> Thanks so much for the reply. >> See inline. >> >> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: >>> Hi, >>> >>>> I've been getting the following error when trying to run a very simple >>>> MapReduce job. >>>> Map finishes without problem, but error occurs as soon as it enters >>>> Reduce phase. >>>> >>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >>>> attempt_201006241812_0001_r_000000_0, Status : FAILED >>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>>> >>>> I am running a 5 node cluster and I believe I have all my settings correct: >>>> >>>> * ulimit -n 32768 >>>> * DNS/RDNS configured properly >>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >>>> * mapred-site.xml : http://pastebin.com/JraVQZcW >>>> >>>> The program is very simple - just counts a unique string in a log file. >>>> See here: http://pastebin.com/5uRG3SFL >>>> >>>> When I run, the job fails and I get the following output. >>>> http://pastebin.com/AhW6StEb >>>> >>>> However, runs fine when I do *not* use substring() on the value (see >>>> map function in code above). >>>> >>>> This runs fine and completes successfully: >>>> String str = val.toString(); >>>> >>>> This causes error and fails: >>>> String str = val.toString().substring(0,10); >>>> >>>> Please let me know if you need any further information. >>>> It would be greatly appreciated if anyone could shed some light on this problem. >>> >>> It catches attention that changing the code to use a substring is >>> causing a difference. Assuming it is consistent and not a red herring, >> >> Yes, this has been consistent over the last week. I was running 0.20.1 >> first and then >> upgrade to 0.20.2 but results have been exactly the same. >> >>> can you look at the counters for the two jobs using the JobTracker web >>> UI - things like map records, bytes etc and see if there is a >>> noticeable difference ? >> >> Ok, so here is the first job using write.set(value.toString()); having >> *no* errors: >> http://pastebin.com/xvy0iGwL >> >> And here is the second job using >> write.set(value.toString().substring(0, 10)); that fails: >> http://pastebin.com/uGw6yNqv >> >> And here is even another where I used a longer, and therefore unique string, >> by write.set(value.toString().substring(0, 20)); This makes every line >> unique, similar to first job. >> Still fails. >> http://pastebin.com/GdQ1rp8i >> >>>Also, are the two programs being run against >>> the exact same input data ? >> >> Yes, exactly the same input: a single csv file with 23K lines. >> Using a shorter string leads to more like keys and therefore more >> combining/reducing, but going >> by the above it seems to fail whether the substring/key is entirely >> unique (23000 combine output records) or >> mostly the same (9 combine output records). >> >>> >>> Also, since the cluster size is small, you could also look at the >>> tasktracker logs on the machines where the maps have run to see if
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.bmdevelopment 2010-07-07, 08:02
Hi, No problems. Thanks so much for your time. Greatly appreciated.
I agree that it must be a configuration problem and so today I was able to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster. I've now noticed that the error occurs when compression is enabled. I've run the basic wordcount example as so: http://pastebin.com/wvDMZZT0 and get the Shuffle Error. TT logs show this error: WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid header checksum: 225702cc (expected 0x2325) Full logs: http://pastebin.com/fVGjcGsW My mapred-site.xml: http://pastebin.com/mQgMrKQw If I remove the compression config settings, the wordcount works fine - no more Shuffle Error. So, I have something wrong with my compression settings I imagine. I'll continue looking into this to see what else I can find out. Thanks a million. On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: > Hi, > > Sorry, I couldn't take a close look at the logs until now. > Unfortunately, I could not see any huge difference between the success > and failure case. Can you please check if things like basic hostname - > ip address mapping are in place (if you have static resolution of > hostnames set up) ? A web search is giving this as the most likely > cause users have faced regarding this problem. Also do the disks have > enough size ? Also, it would be great if you can upload your hadoop > configuration information. > > I do think it is very likely that configuration is the actual problem > because it works in one case anyway. > > Thanks > Hemanth > > On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[EMAIL PROTECTED]> wrote: >> Hello, >> I still have had no luck with this over the past week. >> And even get the same exact problem on a completely different 5 node cluster. >> Is it worth opening an new issue in jira for this? >> Thanks >> >> >> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[EMAIL PROTECTED]> wrote: >>> Hello, >>> Thanks so much for the reply. >>> See inline. >>> >>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: >>>> Hi, >>>> >>>>> I've been getting the following error when trying to run a very simple >>>>> MapReduce job. >>>>> Map finishes without problem, but error occurs as soon as it enters >>>>> Reduce phase. >>>>> >>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED >>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>>>> >>>>> I am running a 5 node cluster and I believe I have all my settings correct: >>>>> >>>>> * ulimit -n 32768 >>>>> * DNS/RDNS configured properly >>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW >>>>> >>>>> The program is very simple - just counts a unique string in a log file. >>>>> See here: http://pastebin.com/5uRG3SFL >>>>> >>>>> When I run, the job fails and I get the following output. >>>>> http://pastebin.com/AhW6StEb >>>>> >>>>> However, runs fine when I do *not* use substring() on the value (see >>>>> map function in code above). >>>>> >>>>> This runs fine and completes successfully: >>>>> String str = val.toString(); >>>>> >>>>> This causes error and fails: >>>>> String str = val.toString().substring(0,10); >>>>> >>>>> Please let me know if you need any further information. >>>>> It would be greatly appreciated if anyone could shed some light on this problem. >>>> >>>> It catches attention that changing the code to use a substring is >>>> causing a difference. Assuming it is consistent and not a red herring, >>> >>> Yes, this has been consistent over the last week. I was running 0.20.1 >>> first and then >>> upgrade to 0.20.2 but results have been exactly the same. >>> >>>> can you look at the counters for the two jobs using the JobTracker web >>>> UI - things like map records, bytes etc and see if there is a >>>> noticeable difference ? >>>
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.bmdevelopment 2010-07-08, 06:49
A little more on this.
So, I've narrowed down the problem to using Lzop compression (com.hadoop.compression.lzo.LzopCodec) for mapred.map.output.compression.codec. <property> <name>mapred.map.output.compression.codec</name> <value>com.hadoop.compression.lzo.LzopCodec</value> </property> If I do the above, I will get the Shuffle Error. If I use DefaultCodec for mapred.map.output.compression.codec. there is no problem. Is this a known issue? Or is this a bug? Doesn't seem like it should be the expected behavior. I would be glad to contribute any further info on this if necessary. Please let me know. Thanks On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[EMAIL PROTECTED]> wrote: > Hi, No problems. Thanks so much for your time. Greatly appreciated. > > I agree that it must be a configuration problem and so today I was able > to start from scratch and did a fresh install of 0.20.2 on the 5 node cluster. > > I've now noticed that the error occurs when compression is enabled. > I've run the basic wordcount example as so: > http://pastebin.com/wvDMZZT0 > and get the Shuffle Error. > > TT logs show this error: > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid > header checksum: 225702cc (expected 0x2325) > Full logs: > http://pastebin.com/fVGjcGsW > > My mapred-site.xml: > http://pastebin.com/mQgMrKQw > > If I remove the compression config settings, the wordcount works fine > - no more Shuffle Error. > So, I have something wrong with my compression settings I imagine. > I'll continue looking into this to see what else I can find out. > > Thanks a million. > > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: >> Hi, >> >> Sorry, I couldn't take a close look at the logs until now. >> Unfortunately, I could not see any huge difference between the success >> and failure case. Can you please check if things like basic hostname - >> ip address mapping are in place (if you have static resolution of >> hostnames set up) ? A web search is giving this as the most likely >> cause users have faced regarding this problem. Also do the disks have >> enough size ? Also, it would be great if you can upload your hadoop >> configuration information. >> >> I do think it is very likely that configuration is the actual problem >> because it works in one case anyway. >> >> Thanks >> Hemanth >> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[EMAIL PROTECTED]> wrote: >>> Hello, >>> I still have had no luck with this over the past week. >>> And even get the same exact problem on a completely different 5 node cluster. >>> Is it worth opening an new issue in jira for this? >>> Thanks >>> >>> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment <[EMAIL PROTECTED]> wrote: >>>> Hello, >>>> Thanks so much for the reply. >>>> See inline. >>>> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala <[EMAIL PROTECTED]> wrote: >>>>> Hi, >>>>> >>>>>> I've been getting the following error when trying to run a very simple >>>>>> MapReduce job. >>>>>> Map finishes without problem, but error occurs as soon as it enters >>>>>> Reduce phase. >>>>>> >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. >>>>>> >>>>>> I am running a 5 node cluster and I believe I have all my settings correct: >>>>>> >>>>>> * ulimit -n 32768 >>>>>> * DNS/RDNS configured properly >>>>>> * hdfs-site.xml : http://pastebin.com/xuZ17bPM >>>>>> * mapred-site.xml : http://pastebin.com/JraVQZcW >>>>>> >>>>>> The program is very simple - just counts a unique string in a log file. >>>>>> See here: http://pastebin.com/5uRG3SFL >>>>>> >>>>>> When I run, the job fails and I get the following output. >>>>>> http://pastebin.com/AhW6StEb >>>>>> >>>>>> However, runs fine when I do *not* use substring() on the value (see >>>>>> map function in code above). >>>>>> >>>>>> This runs fine and completes successfully:
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Ted Yu 2010-07-08, 17:38
Todd fixed a bug where LZO header or block header data may fall on read
boundary: http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 I am wondering if that is related to the issue you saw. On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[EMAIL PROTECTED]>wrote: > A little more on this. > > So, I've narrowed down the problem to using Lzop compression > (com.hadoop.compression.lzo.LzopCodec) > for mapred.map.output.compression.codec. > > <property> > <name>mapred.map.output.compression.codec</name> > <value>com.hadoop.compression.lzo.LzopCodec</value> > </property> > > If I do the above, I will get the Shuffle Error. > If I use DefaultCodec for mapred.map.output.compression.codec. > there is no problem. > > Is this a known issue? Or is this a bug? > Doesn't seem like it should be the expected behavior. > > I would be glad to contribute any further info on this if necessary. > Please let me know. > > Thanks > > On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[EMAIL PROTECTED]> > wrote: > > Hi, No problems. Thanks so much for your time. Greatly appreciated. > > > > I agree that it must be a configuration problem and so today I was able > > to start from scratch and did a fresh install of 0.20.2 on the 5 node > cluster. > > > > I've now noticed that the error occurs when compression is enabled. > > I've run the basic wordcount example as so: > > http://pastebin.com/wvDMZZT0 > > and get the Shuffle Error. > > > > TT logs show this error: > > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid > > header checksum: 225702cc (expected 0x2325) > > Full logs: > > http://pastebin.com/fVGjcGsW > > > > My mapred-site.xml: > > http://pastebin.com/mQgMrKQw > > > > If I remove the compression config settings, the wordcount works fine > > - no more Shuffle Error. > > So, I have something wrong with my compression settings I imagine. > > I'll continue looking into this to see what else I can find out. > > > > Thanks a million. > > > > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[EMAIL PROTECTED]> > wrote: > >> Hi, > >> > >> Sorry, I couldn't take a close look at the logs until now. > >> Unfortunately, I could not see any huge difference between the success > >> and failure case. Can you please check if things like basic hostname - > >> ip address mapping are in place (if you have static resolution of > >> hostnames set up) ? A web search is giving this as the most likely > >> cause users have faced regarding this problem. Also do the disks have > >> enough size ? Also, it would be great if you can upload your hadoop > >> configuration information. > >> > >> I do think it is very likely that configuration is the actual problem > >> because it works in one case anyway. > >> > >> Thanks > >> Hemanth > >> > >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment <[EMAIL PROTECTED]> > wrote: > >>> Hello, > >>> I still have had no luck with this over the past week. > >>> And even get the same exact problem on a completely different 5 node > cluster. > >>> Is it worth opening an new issue in jira for this? > >>> Thanks > >>> > >>> > >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment < > [EMAIL PROTECTED]> wrote: > >>>> Hello, > >>>> Thanks so much for the reply. > >>>> See inline. > >>>> > >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala < > [EMAIL PROTECTED]> wrote: > >>>>> Hi, > >>>>> > >>>>>> I've been getting the following error when trying to run a very > simple > >>>>>> MapReduce job. > >>>>>> Map finishes without problem, but error occurs as soon as it enters > >>>>>> Reduce phase. > >>>>>> > >>>>>> 10/06/24 18:41:00 INFO mapred.JobClient: Task Id : > >>>>>> attempt_201006241812_0001_r_000000_0, Status : FAILED > >>>>>> Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out. > >>>>>> > >>>>>> I am running a 5 node cluster and I believe I have all my settings > correct: > >>>>>> > >>>>>> * ulimit -n 32768 > >>>>>> * DNS/RDNS configured properly
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Todd Lipcon 2010-07-08, 18:22
On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[EMAIL PROTECTED]> wrote:
> Todd fixed a bug where LZO header or block header data may fall on read > boundary: > > http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 > I am wondering if that is related to the issue you saw. > > I don't think this bug would show up in intermediate output compression, but it's certainly possible. There have been a number of bugs fixed in LZO over on github - are you using the github version or the one from Google Code which is out of date? Either mine or Kevin's repo on github should be a good version (I think we called the newest 0.3.4) -Todd > > On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[EMAIL PROTECTED]>wrote: > >> A little more on this. >> >> So, I've narrowed down the problem to using Lzop compression >> (com.hadoop.compression.lzo.LzopCodec) >> for mapred.map.output.compression.codec. >> >> <property> >> <name>mapred.map.output.compression.codec</name> >> <value>com.hadoop.compression.lzo.LzopCodec</value> >> </property> >> >> If I do the above, I will get the Shuffle Error. >> If I use DefaultCodec for mapred.map.output.compression.codec. >> there is no problem. >> >> Is this a known issue? Or is this a bug? >> Doesn't seem like it should be the expected behavior. >> >> I would be glad to contribute any further info on this if necessary. >> Please let me know. >> >> Thanks >> >> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[EMAIL PROTECTED]> >> wrote: >> > Hi, No problems. Thanks so much for your time. Greatly appreciated. >> > >> > I agree that it must be a configuration problem and so today I was able >> > to start from scratch and did a fresh install of 0.20.2 on the 5 node >> cluster. >> > >> > I've now noticed that the error occurs when compression is enabled. >> > I've run the basic wordcount example as so: >> > http://pastebin.com/wvDMZZT0 >> > and get the Shuffle Error. >> > >> > TT logs show this error: >> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid >> > header checksum: 225702cc (expected 0x2325) >> > Full logs: >> > http://pastebin.com/fVGjcGsW >> > >> > My mapred-site.xml: >> > http://pastebin.com/mQgMrKQw >> > >> > If I remove the compression config settings, the wordcount works fine >> > - no more Shuffle Error. >> > So, I have something wrong with my compression settings I imagine. >> > I'll continue looking into this to see what else I can find out. >> > >> > Thanks a million. >> > >> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[EMAIL PROTECTED]> >> wrote: >> >> Hi, >> >> >> >> Sorry, I couldn't take a close look at the logs until now. >> >> Unfortunately, I could not see any huge difference between the success >> >> and failure case. Can you please check if things like basic hostname - >> >> ip address mapping are in place (if you have static resolution of >> >> hostnames set up) ? A web search is giving this as the most likely >> >> cause users have faced regarding this problem. Also do the disks have >> >> enough size ? Also, it would be great if you can upload your hadoop >> >> configuration information. >> >> >> >> I do think it is very likely that configuration is the actual problem >> >> because it works in one case anyway. >> >> >> >> Thanks >> >> Hemanth >> >> >> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment < >> [EMAIL PROTECTED]> wrote: >> >>> Hello, >> >>> I still have had no luck with this over the past week. >> >>> And even get the same exact problem on a completely different 5 node >> cluster. >> >>> Is it worth opening an new issue in jira for this? >> >>> Thanks >> >>> >> >>> >> >>> On Fri, Jun 25, 2010 at 11:56 PM, bmdevelopment < >> [EMAIL PROTECTED]> wrote: >> >>>> Hello, >> >>>> Thanks so much for the reply. >> >>>> See inline. >> >>>> >> >>>> On Fri, Jun 25, 2010 at 12:40 AM, Hemanth Yamijala < >> [EMAIL PROTECTED]> wrote: >> >>>>> Hi, >> >>>>> >> >>>>>> I've been getting the following error when trying to run a very Todd Lipcon Software Engineer, Cloudera
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.bmdevelopment 2010-07-09, 03:26
Thanks everyone.
Yes, using the Google Code version referenced on the wiki: http://wiki.apache.org/hadoop/UsingLzoCompression I will try the latest version and see if that fixes the problem. http://github.com/kevinweil/hadoop-lzo Thanks On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[EMAIL PROTECTED]> wrote: >> >> Todd fixed a bug where LZO header or block header data may fall on read >> boundary: >> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 >> >> >> I am wondering if that is related to the issue you saw. > > I don't think this bug would show up in intermediate output compression, but > it's certainly possible. There have been a number of bugs fixed in LZO over > on github - are you using the github version or the one from Google Code > which is out of date? Either mine or Kevin's repo on github should be a good > version (I think we called the newest 0.3.4) > -Todd > >> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[EMAIL PROTECTED]> >> wrote: >>> >>> A little more on this. >>> >>> So, I've narrowed down the problem to using Lzop compression >>> (com.hadoop.compression.lzo.LzopCodec) >>> for mapred.map.output.compression.codec. >>> >>> <property> >>> <name>mapred.map.output.compression.codec</name> >>> <value>com.hadoop.compression.lzo.LzopCodec</value> >>> </property> >>> >>> If I do the above, I will get the Shuffle Error. >>> If I use DefaultCodec for mapred.map.output.compression.codec. >>> there is no problem. >>> >>> Is this a known issue? Or is this a bug? >>> Doesn't seem like it should be the expected behavior. >>> >>> I would be glad to contribute any further info on this if necessary. >>> Please let me know. >>> >>> Thanks >>> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[EMAIL PROTECTED]> >>> wrote: >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated. >>> > >>> > I agree that it must be a configuration problem and so today I was able >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node >>> > cluster. >>> > >>> > I've now noticed that the error occurs when compression is enabled. >>> > I've run the basic wordcount example as so: >>> > http://pastebin.com/wvDMZZT0 >>> > and get the Shuffle Error. >>> > >>> > TT logs show this error: >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: Invalid >>> > header checksum: 225702cc (expected 0x2325) >>> > Full logs: >>> > http://pastebin.com/fVGjcGsW >>> > >>> > My mapred-site.xml: >>> > http://pastebin.com/mQgMrKQw >>> > >>> > If I remove the compression config settings, the wordcount works fine >>> > - no more Shuffle Error. >>> > So, I have something wrong with my compression settings I imagine. >>> > I'll continue looking into this to see what else I can find out. >>> > >>> > Thanks a million. >>> > >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[EMAIL PROTECTED]> >>> > wrote: >>> >> Hi, >>> >> >>> >> Sorry, I couldn't take a close look at the logs until now. >>> >> Unfortunately, I could not see any huge difference between the success >>> >> and failure case. Can you please check if things like basic hostname - >>> >> ip address mapping are in place (if you have static resolution of >>> >> hostnames set up) ? A web search is giving this as the most likely >>> >> cause users have faced regarding this problem. Also do the disks have >>> >> enough size ? Also, it would be great if you can upload your hadoop >>> >> configuration information. >>> >> >>> >> I do think it is very likely that configuration is the actual problem >>> >> because it works in one case anyway. >>> >> >>> >> Thanks >>> >> Hemanth >>> >> >>> >> On Mon, Jul 5, 2010 at 12:41 PM, bmdevelopment >>> >> <[EMAIL PROTECTED]> wrote: >>> >>> Hello, >>> >>> I still have had no luck with this over the past week. >>> >>> And even get the same exact problem on a completely different 5 node >>
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Ted Yu 2010-07-09, 04:54
I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically
mention this potential issue so that other people can avoid such problem. Feel free to add more onto it. On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <[EMAIL PROTECTED]>wrote: > Thanks everyone. > > Yes, using the Google Code version referenced on the wiki: > http://wiki.apache.org/hadoop/UsingLzoCompression > > I will try the latest version and see if that fixes the problem. > http://github.com/kevinweil/hadoop-lzo > > Thanks > > On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[EMAIL PROTECTED]> wrote: > >> > >> Todd fixed a bug where LZO header or block header data may fall on read > >> boundary: > >> > >> > http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 > >> > >> > >> I am wondering if that is related to the issue you saw. > > > > I don't think this bug would show up in intermediate output compression, > but > > it's certainly possible. There have been a number of bugs fixed in LZO > over > > on github - are you using the github version or the one from Google Code > > which is out of date? Either mine or Kevin's repo on github should be a > good > > version (I think we called the newest 0.3.4) > > -Todd > > > >> > >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment <[EMAIL PROTECTED] > > > >> wrote: > >>> > >>> A little more on this. > >>> > >>> So, I've narrowed down the problem to using Lzop compression > >>> (com.hadoop.compression.lzo.LzopCodec) > >>> for mapred.map.output.compression.codec. > >>> > >>> <property> > >>> <name>mapred.map.output.compression.codec</name> > >>> <value>com.hadoop.compression.lzo.LzopCodec</value> > >>> </property> > >>> > >>> If I do the above, I will get the Shuffle Error. > >>> If I use DefaultCodec for mapred.map.output.compression.codec. > >>> there is no problem. > >>> > >>> Is this a known issue? Or is this a bug? > >>> Doesn't seem like it should be the expected behavior. > >>> > >>> I would be glad to contribute any further info on this if necessary. > >>> Please let me know. > >>> > >>> Thanks > >>> > >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment <[EMAIL PROTECTED] > > > >>> wrote: > >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated. > >>> > > >>> > I agree that it must be a configuration problem and so today I was > able > >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 node > >>> > cluster. > >>> > > >>> > I've now noticed that the error occurs when compression is enabled. > >>> > I've run the basic wordcount example as so: > >>> > http://pastebin.com/wvDMZZT0 > >>> > and get the Shuffle Error. > >>> > > >>> > TT logs show this error: > >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: > Invalid > >>> > header checksum: 225702cc (expected 0x2325) > >>> > Full logs: > >>> > http://pastebin.com/fVGjcGsW > >>> > > >>> > My mapred-site.xml: > >>> > http://pastebin.com/mQgMrKQw > >>> > > >>> > If I remove the compression config settings, the wordcount works fine > >>> > - no more Shuffle Error. > >>> > So, I have something wrong with my compression settings I imagine. > >>> > I'll continue looking into this to see what else I can find out. > >>> > > >>> > Thanks a million. > >>> > > >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala <[EMAIL PROTECTED] > > > >>> > wrote: > >>> >> Hi, > >>> >> > >>> >> Sorry, I couldn't take a close look at the logs until now. > >>> >> Unfortunately, I could not see any huge difference between the > success > >>> >> and failure case. Can you please check if things like basic hostname > - > >>> >> ip address mapping are in place (if you have static resolution of > >>> >> hostnames set up) ? A web search is giving this as the most likely > >>> >> cause users have faced regarding this problem. Also do the disks > have > >>> >> enough size ? Also, it would be great if you can upload your hadoop
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.bmdevelopment 2010-07-09, 09:07
Hi, I updated to the version here:
http://github.com/kevinweil/hadoop-lzo However, when I use lzop for intermediate compression I am still having trouble - the reduce phase now freezes at 99% and eventually fails. No immediate problem, because I can use the default codec. But may be of concern to someone else. Thanks On Fri, Jul 9, 2010 at 1:54 PM, Ted Yu <[EMAIL PROTECTED]> wrote: > I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically > mention this potential issue so that other people can avoid such problem. > Feel free to add more onto it. > > On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <[EMAIL PROTECTED]> > wrote: >> >> Thanks everyone. >> >> Yes, using the Google Code version referenced on the wiki: >> http://wiki.apache.org/hadoop/UsingLzoCompression >> >> I will try the latest version and see if that fixes the problem. >> http://github.com/kevinweil/hadoop-lzo >> >> Thanks >> >> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[EMAIL PROTECTED]> wrote: >> >> >> >> Todd fixed a bug where LZO header or block header data may fall on read >> >> boundary: >> >> >> >> >> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 >> >> >> >> >> >> I am wondering if that is related to the issue you saw. >> > >> > I don't think this bug would show up in intermediate output compression, >> > but >> > it's certainly possible. There have been a number of bugs fixed in LZO >> > over >> > on github - are you using the github version or the one from Google Code >> > which is out of date? Either mine or Kevin's repo on github should be a >> > good >> > version (I think we called the newest 0.3.4) >> > -Todd >> > >> >> >> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment >> >> <[EMAIL PROTECTED]> >> >> wrote: >> >>> >> >>> A little more on this. >> >>> >> >>> So, I've narrowed down the problem to using Lzop compression >> >>> (com.hadoop.compression.lzo.LzopCodec) >> >>> for mapred.map.output.compression.codec. >> >>> >> >>> <property> >> >>> <name>mapred.map.output.compression.codec</name> >> >>> <value>com.hadoop.compression.lzo.LzopCodec</value> >> >>> </property> >> >>> >> >>> If I do the above, I will get the Shuffle Error. >> >>> If I use DefaultCodec for mapred.map.output.compression.codec. >> >>> there is no problem. >> >>> >> >>> Is this a known issue? Or is this a bug? >> >>> Doesn't seem like it should be the expected behavior. >> >>> >> >>> I would be glad to contribute any further info on this if necessary. >> >>> Please let me know. >> >>> >> >>> Thanks >> >>> >> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment >> >>> <[EMAIL PROTECTED]> >> >>> wrote: >> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated. >> >>> > >> >>> > I agree that it must be a configuration problem and so today I was >> >>> > able >> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 >> >>> > node >> >>> > cluster. >> >>> > >> >>> > I've now noticed that the error occurs when compression is enabled. >> >>> > I've run the basic wordcount example as so: >> >>> > http://pastebin.com/wvDMZZT0 >> >>> > and get the Shuffle Error. >> >>> > >> >>> > TT logs show this error: >> >>> > WARN org.apache.hadoop.mapred.ReduceTask: java.io.IOException: >> >>> > Invalid >> >>> > header checksum: 225702cc (expected 0x2325) >> >>> > Full logs: >> >>> > http://pastebin.com/fVGjcGsW >> >>> > >> >>> > My mapred-site.xml: >> >>> > http://pastebin.com/mQgMrKQw >> >>> > >> >>> > If I remove the compression config settings, the wordcount works >> >>> > fine >> >>> > - no more Shuffle Error. >> >>> > So, I have something wrong with my compression settings I imagine. >> >>> > I'll continue looking into this to see what else I can find out. >> >>> > >> >>> > Thanks a million. >> >>> > >> >>> > On Tue, Jul 6, 2010 at 5:34 PM, Hemanth Yamijala >> >>> > <[EMAIL PROTECTED]> >> >>> > wrote:
-
Re: Shuffle Error: Exceeded MAX_FAILED_UNIQUE_FETCHES; bailing-out.Ted Yu 2010-07-09, 13:18
Did you check task tracker log and log from your reducer to see if
anythng was wrong ? Please also capture jstack output so that we can help you diagnose. On Friday, July 9, 2010, bmdevelopment <[EMAIL PROTECTED]> wrote: > Hi, I updated to the version here: > http://github.com/kevinweil/hadoop-lzo > > However, when I use lzop for intermediate compression I > am still having trouble - the reduce phase now freezes at 99% and > eventually fails. > No immediate problem, because I can use the default codec. > But may be of concern to someone else. > > Thanks > > On Fri, Jul 9, 2010 at 1:54 PM, Ted Yu <[EMAIL PROTECTED]> wrote: >> I updated http://wiki.apache.org/hadoop/UsingLzoCompression to specifically >> mention this potential issue so that other people can avoid such problem. >> Feel free to add more onto it. >> >> On Thu, Jul 8, 2010 at 8:26 PM, bmdevelopment <[EMAIL PROTECTED]> >> wrote: >>> >>> Thanks everyone. >>> >>> Yes, using the Google Code version referenced on the wiki: >>> http://wiki.apache.org/hadoop/UsingLzoCompression >>> >>> I will try the latest version and see if that fixes the problem. >>> http://github.com/kevinweil/hadoop-lzo >>> >>> Thanks >>> >>> On Fri, Jul 9, 2010 at 3:22 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: >>> > On Thu, Jul 8, 2010 at 10:38 AM, Ted Yu <[EMAIL PROTECTED]> wrote: >>> >> >>> >> Todd fixed a bug where LZO header or block header data may fall on read >>> >> boundary: >>> >> >>> >> >>> >> http://github.com/toddlipcon/hadoop-lzo/commit/f3bc3f8d003bb8e24f254b25bca2053f731cdd58 >>> >> >>> >> >>> >> I am wondering if that is related to the issue you saw. >>> > >>> > I don't think this bug would show up in intermediate output compression, >>> > but >>> > it's certainly possible. There have been a number of bugs fixed in LZO >>> > over >>> > on github - are you using the github version or the one from Google Code >>> > which is out of date? Either mine or Kevin's repo on github should be a >>> > good >>> > version (I think we called the newest 0.3.4) >>> > -Todd >>> > >>> >> >>> >> On Wed, Jul 7, 2010 at 11:49 PM, bmdevelopment >>> >> <[EMAIL PROTECTED]> >>> >> wrote: >>> >>> >>> >>> A little more on this. >>> >>> >>> >>> So, I've narrowed down the problem to using Lzop compression >>> >>> (com.hadoop.compression.lzo.LzopCodec) >>> >>> for mapred.map.output.compression.codec. >>> >>> >>> >>> <property> >>> >>> <name>mapred.map.output.compression.codec</name> >>> >>> <value>com.hadoop.compression.lzo.LzopCodec</value> >>> >>> </property> >>> >>> >>> >>> If I do the above, I will get the Shuffle Error. >>> >>> If I use DefaultCodec for mapred.map.output.compression.codec. >>> >>> there is no problem. >>> >>> >>> >>> Is this a known issue? Or is this a bug? >>> >>> Doesn't seem like it should be the expected behavior. >>> >>> >>> >>> I would be glad to contribute any further info on this if necessary. >>> >>> Please let me know. >>> >>> >>> >>> Thanks >>> >>> >>> >>> On Wed, Jul 7, 2010 at 5:02 PM, bmdevelopment >>> >>> <[EMAIL PROTECTED]> >>> >>> wrote: >>> >>> > Hi, No problems. Thanks so much for your time. Greatly appreciated. >>> >>> > >>> >>> > I agree that it must be a configuration problem and so today I was >>> >>> > able >>> >>> > to start from scratch and did a fresh install of 0.20.2 on the 5 >>> >>> > node >>> >>> > cluster. >>> >>> > >> |