Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
One thing I can say for sure is that  generateSubStrings() is not slow -
Every input line in my sample is 100 characters and the timing should be
very similar
from one run to the next.

This sample is a simplification of a more complex real problem where we see
timeouts when a
map generates significantly more records than it reads - thousands for each
input. I see nothing to
cause me to suppose that my code is slow and I strongly suspect the problem
is internal to hadoop.

On Fri, Jan 20, 2012 at 1:47 PM, Vinod Kumar Vavilapalli <
[EMAIL PROTECTED]> wrote:

> Every so often, you should do a context.progress() so that the
> framework knows that this map is doing useful work. That will prevent
> the framework from killing it after 10 mins. The framework
> automatically does this every time you do a
> context.write()/context.setStatus(), but if the map is stuck for 10
> mins during processing some keys (may be generateSubStrings()), that
> may lead to a timeout.
>
> HTH,
> +Vinod
>
>
> On Fri, Jan 20, 2012 at 9:16 AM, Steve Lewis <[EMAIL PROTECTED]>
> wrote:
> > We have been having problems with mappers timing out after 600 sec when
> the
> > mapper writes many more, say thousands of records for every
> > input record - even when the code in the mapper is small and fast. I have
> > no idea what could cause the system to be so slow and am reluctant to
> raise
> > the 600 sec limit without understanding why there should be a timeout
> when
> > all MY code is very fast.
> >
> > I am enclosing a small sample which illustrates the problem. It will
> > generate a 4GB text file on hdfs if the input file does not exist or is
> not
> > at least that size and this will take some time (hours in my
> configuration)
> > - then the code is essentially wordcount but instead of finding and
> > emitting words - the mapper emits all substrings of the input data - this
> > generates a much larger output data and number of output records than
> > wordcount generates.
> > Still, the amount of data emitted is no larger than other data sets I
> know
> > Hadoop can handle.
> >
> > All mappers on my 8 node cluster eventually timeout after 600 sec - even
> > though I see nothing in the code which is even a little slow and suspect
> > that any slow behavior is in the  called Hadoop code. This is similar to
> a
> > problem we have in bioinformatics where a  colleague saw timeouts on his
> 50
> > node cluster.
> >
> > I would appreciate any help from the group. Note - if you have a text
> file
> > at least 4 GB the program will take that as an imput without trying to
> > create its own file.
> > /*
> >
> ===========================================================================================> > */
> > import org.apache.hadoop.conf.*;
> > import org.apache.hadoop.fs.*;
> > import org.apache.hadoop.io.*;
> > import org.apache.hadoop.mapreduce.*;
> > import org.apache.hadoop.mapreduce.lib.input.*;
> > import org.apache.hadoop.mapreduce.lib.output.*;
> > import org.apache.hadoop.util.*;
> >
> > import java.io.*;
> > import java.util.*;
> >  /**
> >  * org.systemsbiology.hadoop.SubstringGenerator
> >  *
> >  * This illustrates an issue we are having where a mapper generating a
> > much larger volume of
> >  * data ans number of records times out even though the code is small,
> > simple and fast
> >  *
> >  * NOTE!!! as written the program will generate a 4GB file in hdfs with
> > good input data -
> >  * this is done only if the file does not exist but may take several
> > hours. It will only be
> >  * done once. After that the failure is fairly fast
> >  *
> >  * What this will do is count  unique Substrings of lines of length
> >  * between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all
> >  * substrings and  then using the word could algorithm
> >  * What is interesting is that the number and volume or writes in the
> >  * map phase is MUCH larger than the number of reads and the volume of
> > read data
> >  *
> >  * The example is artificial but similar the some real BioInformatics

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com