Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
Steve Lewis 2012-01-20, 21:49
Well - I am running the job over a vpn so I am not on a fast network to the
cluster.
The job runs fine for small input files - we did not run into issues until
the input file got in the
multi gigabyte range

On Fri, Jan 20, 2012 at 11:29 AM, Raj V <[EMAIL PROTECTED]> wrote:

> Steve
>
> There seems to be something wrong with either networking or storage. Why
> does it take "hours" to generate 4GB text file?
>
> Raj
>
>
>
> >________________________________
> > From: Steve Lewis <[EMAIL PROTECTED]>
> >To: common-user <[EMAIL PROTECTED]>; Josh Patterson <
> [EMAIL PROTECTED]>
> >Sent: Friday, January 20, 2012 9:16 AM
> >Subject: Problems with timeout when a Hadoop job generates a large number
> of key-value pairs
> >
> >We have been having problems with mappers timing out after 600 sec when
> the
> >mapper writes many more, say thousands of records for every
> >input record - even when the code in the mapper is small and fast. I have
> >no idea what could cause the system to be so slow and am reluctant to
> raise
> >the 600 sec limit without understanding why there should be a timeout when
> >all MY code is very fast.
> >
> >I am enclosing a small sample which illustrates the problem. It will
> >generate a 4GB text file on hdfs if the input file does not exist or is
> not
> >at least that size and this will take some time (hours in my
> configuration)
> >- then the code is essentially wordcount but instead of finding and
> >emitting words - the mapper emits all substrings of the input data - this
> >generates a much larger output data and number of output records than
> >wordcount generates.
> >Still, the amount of data emitted is no larger than other data sets I know
> >Hadoop can handle.
> >
> >All mappers on my 8 node cluster eventually timeout after 600 sec - even
> >though I see nothing in the code which is even a little slow and suspect
> >that any slow behavior is in the  called Hadoop code. This is similar to a
> >problem we have in bioinformatics where a  colleague saw timeouts on his
> 50
> >node cluster.
> >
> >I would appreciate any help from the group. Note - if you have a text file
> >at least 4 GB the program will take that as an imput without trying to
> >create its own file.
> >/*
>
> >===========================================================================================> >*/
> >import org.apache.hadoop.conf.*;
> >import org.apache.hadoop.fs.*;
> >import org.apache.hadoop.io.*;
> >import org.apache.hadoop.mapreduce.*;
> >import org.apache.hadoop.mapreduce.lib.input.*;
> >import org.apache.hadoop.mapreduce.lib.output.*;
> >import org.apache.hadoop.util.*;
> >
> >import java.io.*;
> >import java.util.*;
> >/**
> >* org.systemsbiology.hadoop.SubstringGenerator
> >  *
> >  * This illustrates an issue we are having where a mapper generating a
> >much larger volume of
> >  * data ans number of records times out even though the code is small,
> >simple and fast
> >  *
> >  * NOTE!!! as written the program will generate a 4GB file in hdfs with
> >good input data -
> >  * this is done only if the file does not exist but may take several
> >hours. It will only be
> >  * done once. After that the failure is fairly fast
> >  *
> >* What this will do is count  unique Substrings of lines of length
> >* between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all
> >* substrings and  then using the word could algorithm
> >* What is interesting is that the number and volume or writes in the
> >  * map phase is MUCH larger than the number of reads and the volume of
> >read data
> >  *
> >  * The example is artificial but similar the some real BioInformatics
> >problems -
> >  *  for example finding all substrings in a gemome can be important for
> >the design of
> >  *  microarrays.
> >  *
> >  *  While the real problem is more complex - the problem is that
> >  *  when the input file is large enough the mappers time out failing to
> >report after
> >  *  600 sec. There is nothing slow in any of the application code and

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com