Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
One thing I can say for sure is that  generateSubStrings() is not slow -
Every input line in my sample is 100 characters and the timing should be
very similar
from one run to the next.

This sample is a simplification of a more complex real problem where we see
timeouts when a
map generates significantly more records than it reads - thousands for each
input. I see nothing to
cause me to suppose that my code is slow and I strongly suspect the problem
is internal to hadoop.

On Fri, Jan 20, 2012 at 1:47 PM, Vinod Kumar Vavilapalli <
[EMAIL PROTECTED]> wrote:

> Every so often, you should do a context.progress() so that the
> framework knows that this map is doing useful work. That will prevent
> the framework from killing it after 10 mins. The framework
> automatically does this every time you do a
> context.write()/context.setStatus(), but if the map is stuck for 10
> mins during processing some keys (may be generateSubStrings()), that
> may lead to a timeout.
>
> HTH,
> +Vinod
>
>
> On Fri, Jan 20, 2012 at 9:16 AM, Steve Lewis <[EMAIL PROTECTED]>
> wrote:
> > We have been having problems with mappers timing out after 600 sec when
> the
> > mapper writes many more, say thousands of records for every
> > input record - even when the code in the mapper is small and fast. I have
> > no idea what could cause the system to be so slow and am reluctant to
> raise
> > the 600 sec limit without understanding why there should be a timeout
> when
> > all MY code is very fast.
> >
> > I am enclosing a small sample which illustrates the problem. It will
> > generate a 4GB text file on hdfs if the input file does not exist or is
> not
> > at least that size and this will take some time (hours in my
> configuration)
> > - then the code is essentially wordcount but instead of finding and
> > emitting words - the mapper emits all substrings of the input data - this
> > generates a much larger output data and number of output records than
> > wordcount generates.
> > Still, the amount of data emitted is no larger than other data sets I
> know
> > Hadoop can handle.
> >
> > All mappers on my 8 node cluster eventually timeout after 600 sec - even
> > though I see nothing in the code which is even a little slow and suspect
> > that any slow behavior is in the  called Hadoop code. This is similar to
> a
> > problem we have in bioinformatics where a  colleague saw timeouts on his
> 50
> > node cluster.
> >
> > I would appreciate any help from the group. Note - if you have a text
> file
> > at least 4 GB the program will take that as an imput without trying to
> > create its own file.
> > /*
> >
> ===========================================================================================> > */
> > import org.apache.hadoop.conf.*;
> > import org.apache.hadoop.fs.*;
> > import org.apache.hadoop.io.*;
> > import org.apache.hadoop.mapreduce.*;
> > import org.apache.hadoop.mapreduce.lib.input.*;
> > import org.apache.hadoop.mapreduce.lib.output.*;
> > import org.apache.hadoop.util.*;
> >
> > import java.io.*;
> > import java.util.*;
> >  /**
> >  * org.systemsbiology.hadoop.SubstringGenerator
> >  *
> >  * This illustrates an issue we are having where a mapper generating a
> > much larger volume of
> >  * data ans number of records times out even though the code is small,
> > simple and fast
> >  *
> >  * NOTE!!! as written the program will generate a 4GB file in hdfs with
> > good input data -
> >  * this is done only if the file does not exist but may take several
> > hours. It will only be
> >  * done once. After that the failure is fairly fast
> >  *
> >  * What this will do is count  unique Substrings of lines of length
> >  * between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all
> >  * substrings and  then using the word could algorithm
> >  * What is interesting is that the number and volume or writes in the
> >  * map phase is MUCH larger than the number of reads and the volume of
> > read data
> >  *
> >  * The example is artificial but similar the some real BioInformatics

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB