Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
On Fri, Jan 20, 2012 at 12:18 PM, Michel Segel <[EMAIL PROTECTED]>wrote:

> Steve,
> If you want me to debug your code, I'll be glad to set up a billable
> contract... ;-)
>
> What I am willing to do is to help you to debug your code..
The code seems to work well for small input files and is basically a
standard sample.

> .
>
> Did you time how long it takes in the Mapper.map() method?
> The reason I asked this is to first confirm that you are failing within a
> map() method.
> It could be that you're just not updating your status...
>

The map map method starts out running very fast - generateSubstrings - the
only interesting part runs in milliseconds. The only other thing the mapper
does is context,write which SHOULD update status

>
> You said that you are writing many output records for a single input.
>
> So let's take a look at your code.
> Are all writes of the same length? Meaning that in each iteration of
> Mapper.map() you will always write. K number of rows?
>

Because in my sample the input strings are the same length - every call to
the mapper will write the same number of records

>
> If so, ask yourself why some iterations are taking longer and longer?
>

I believe the issue may relate to local storage getting filled and Hadoop
taking a LOT of time to rebalance the output, Assuming the string length is
the same on each map there is no reason for some iterations to me longer
than others

>
> Note: I'm assuming that the time for each iteration is taking longer than
> the previous...
>
> I assume so as well since in m,y cluster the first 50% of mapping goes
pretty fast

> Or am I missing something?
>
> How do I get timing of map iteratons??

> -Mike
>
> Sent from a remote device. Please excuse any typos...
>
> Mike Segel
>
> On Jan 20, 2012, at 11:16 AM, Steve Lewis <[EMAIL PROTECTED]> wrote:
>
> > We have been having problems with mappers timing out after 600 sec when
> the
> > mapper writes many more, say thousands of records for every
> > input record - even when the code in the mapper is small and fast. I
> > no idea what could cause the system to be so slow and am reluctant to
> raise
> > the 600 sec limit without understanding why there should be a timeout
> when
> > all MY code is very fast.
> > P
> > I am enclosing a small sample which illustrates the problem. It will
> > generate a 4GB text file on hdfs if the input file does not exist or is
> not
> > at least that size and this will take some time (hours in my
> configuration)
> > - then the code is essentially wordcount but instead of finding and
> > emitting words - the mapper emits all substrings of the input data - this
> > generates a much larger output data and number of output records than
> > wordcount generates.
> > Still, the amount of data emitted is no larger than other data sets I
> know
> > Hadoop can handle.
> >
> > All mappers on my 8 node cluster eventually timeout after 600 sec - even
> > though I see nothing in the code which is even a little slow and suspect
> > that any slow behavior is in the  called Hadoop code. This is similar to
> a
> > problem we have in bioinformatics where a  colleague saw timeouts on his
> 50
> > node cluster.
> >
> > I would appreciate any help from the group. Note - if you have a text
> file
> > at least 4 GB the program will take that as an imput without trying to
> > create its own file.
> > /*
> >
> ===========================================================================================> > */
> > import org.apache.hadoop.conf.*;
> > import org.apache.hadoop.fs.*;
> > import org.apache.hadoop.io.*;
> > import org.apache.hadoop.mapreduce.*;
> > import org.apache.hadoop.mapreduce.lib.input.*;
> > import org.apache.hadoop.mapreduce.lib.output.*;
> > import org.apache.hadoop.util.*;
> >
> > import java.io.*;
> > import java.util.*;
> > /**
> > * org.systemsbiology.hadoop.SubstringGenerator
> >  *
> >  * This illustrates an issue we are having where a mapper generating a
> > much larger volume of

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB