Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
Steve Lewis 2012-01-20, 21:57
On Fri, Jan 20, 2012 at 12:18 PM, Michel Segel <[EMAIL PROTECTED]>wrote:

> Steve,
> If you want me to debug your code, I'll be glad to set up a billable
> contract... ;-)
>
> What I am willing to do is to help you to debug your code..
The code seems to work well for small input files and is basically a
standard sample.

> .
>
> Did you time how long it takes in the Mapper.map() method?
> The reason I asked this is to first confirm that you are failing within a
> map() method.
> It could be that you're just not updating your status...
>

The map map method starts out running very fast - generateSubstrings - the
only interesting part runs in milliseconds. The only other thing the mapper
does is context,write which SHOULD update status

>
> You said that you are writing many output records for a single input.
>
> So let's take a look at your code.
> Are all writes of the same length? Meaning that in each iteration of
> Mapper.map() you will always write. K number of rows?
>

Because in my sample the input strings are the same length - every call to
the mapper will write the same number of records

>
> If so, ask yourself why some iterations are taking longer and longer?
>

I believe the issue may relate to local storage getting filled and Hadoop
taking a LOT of time to rebalance the output, Assuming the string length is
the same on each map there is no reason for some iterations to me longer
than others

>
> Note: I'm assuming that the time for each iteration is taking longer than
> the previous...
>
> I assume so as well since in m,y cluster the first 50% of mapping goes
pretty fast

> Or am I missing something?
>
> How do I get timing of map iteratons??

> -Mike
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jan 20, 2012, at 11:16 AM, Steve Lewis <[EMAIL PROTECTED]> wrote:
>
> > We have been having problems with mappers timing out after 600 sec when
> the
> > mapper writes many more, say thousands of records for every
> > input record - even when the code in the mapper is small and fast. I
> > no idea what could cause the system to be so slow and am reluctant to
> raise
> > the 600 sec limit without understanding why there should be a timeout
> when
> > all MY code is very fast.
> > P
> > I am enclosing a small sample which illustrates the problem. It will
> > generate a 4GB text file on hdfs if the input file does not exist or is
> not
> > at least that size and this will take some time (hours in my
> configuration)
> > - then the code is essentially wordcount but instead of finding and
> > emitting words - the mapper emits all substrings of the input data - this
> > generates a much larger output data and number of output records than
> > wordcount generates.
> > Still, the amount of data emitted is no larger than other data sets I
> know
> > Hadoop can handle.
> >
> > All mappers on my 8 node cluster eventually timeout after 600 sec - even
> > though I see nothing in the code which is even a little slow and suspect
> > that any slow behavior is in the  called Hadoop code. This is similar to
> a
> > problem we have in bioinformatics where a  colleague saw timeouts on his
> 50
> > node cluster.
> >
> > I would appreciate any help from the group. Note - if you have a text
> file
> > at least 4 GB the program will take that as an imput without trying to
> > create its own file.
> > /*
> >
> ===========================================================================================> > */
> > import org.apache.hadoop.conf.*;
> > import org.apache.hadoop.fs.*;
> > import org.apache.hadoop.io.*;
> > import org.apache.hadoop.mapreduce.*;
> > import org.apache.hadoop.mapreduce.lib.input.*;
> > import org.apache.hadoop.mapreduce.lib.output.*;
> > import org.apache.hadoop.util.*;
> >
> > import java.io.*;
> > import java.util.*;
> > /**
> > * org.systemsbiology.hadoop.SubstringGenerator
> >  *
> >  * This illustrates an issue we are having where a mapper generating a
> > much larger volume of

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com