Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
Well - I am running the job over a vpn so I am not on a fast network to the
cluster.
The job runs fine for small input files - we did not run into issues until
the input file got in the
multi gigabyte range

On Fri, Jan 20, 2012 at 11:29 AM, Raj V <[EMAIL PROTECTED]> wrote:

> Steve
>
> There seems to be something wrong with either networking or storage. Why
> does it take "hours" to generate 4GB text file?
>
> Raj
>
>
>
> >________________________________
> > From: Steve Lewis <[EMAIL PROTECTED]>
> >To: common-user <[EMAIL PROTECTED]>; Josh Patterson <
> [EMAIL PROTECTED]>
> >Sent: Friday, January 20, 2012 9:16 AM
> >Subject: Problems with timeout when a Hadoop job generates a large number
> of key-value pairs
> >
> >We have been having problems with mappers timing out after 600 sec when
> the
> >mapper writes many more, say thousands of records for every
> >input record - even when the code in the mapper is small and fast. I have
> >no idea what could cause the system to be so slow and am reluctant to
> raise
> >the 600 sec limit without understanding why there should be a timeout when
> >all MY code is very fast.
> >
> >I am enclosing a small sample which illustrates the problem. It will
> >generate a 4GB text file on hdfs if the input file does not exist or is
> not
> >at least that size and this will take some time (hours in my
> configuration)
> >- then the code is essentially wordcount but instead of finding and
> >emitting words - the mapper emits all substrings of the input data - this
> >generates a much larger output data and number of output records than
> >wordcount generates.
> >Still, the amount of data emitted is no larger than other data sets I know
> >Hadoop can handle.
> >
> >All mappers on my 8 node cluster eventually timeout after 600 sec - even
> >though I see nothing in the code which is even a little slow and suspect
> >that any slow behavior is in the  called Hadoop code. This is similar to a
> >problem we have in bioinformatics where a  colleague saw timeouts on his
> 50
> >node cluster.
> >
> >I would appreciate any help from the group. Note - if you have a text file
> >at least 4 GB the program will take that as an imput without trying to
> >create its own file.
> >/*
>
> >===========================================================================================> >*/
> >import org.apache.hadoop.conf.*;
> >import org.apache.hadoop.fs.*;
> >import org.apache.hadoop.io.*;
> >import org.apache.hadoop.mapreduce.*;
> >import org.apache.hadoop.mapreduce.lib.input.*;
> >import org.apache.hadoop.mapreduce.lib.output.*;
> >import org.apache.hadoop.util.*;
> >
> >import java.io.*;
> >import java.util.*;
> >/**
> >* org.systemsbiology.hadoop.SubstringGenerator
> >  *
> >  * This illustrates an issue we are having where a mapper generating a
> >much larger volume of
> >  * data ans number of records times out even though the code is small,
> >simple and fast
> >  *
> >  * NOTE!!! as written the program will generate a 4GB file in hdfs with
> >good input data -
> >  * this is done only if the file does not exist but may take several
> >hours. It will only be
> >  * done once. After that the failure is fairly fast
> >  *
> >* What this will do is count  unique Substrings of lines of length
> >* between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all
> >* substrings and  then using the word could algorithm
> >* What is interesting is that the number and volume or writes in the
> >  * map phase is MUCH larger than the number of reads and the volume of
> >read data
> >  *
> >  * The example is artificial but similar the some real BioInformatics
> >problems -
> >  *  for example finding all substrings in a gemome can be important for
> >the design of
> >  *  microarrays.
> >  *
> >  *  While the real problem is more complex - the problem is that
> >  *  when the input file is large enough the mappers time out failing to
> >report after
> >  *  600 sec. There is nothing slow in any of the application code and

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB