Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Problems with timeout when a Hadoop job generates a large number of key-value pairs

Copy link to this message
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
Every so often, you should do a context.progress() so that the
framework knows that this map is doing useful work. That will prevent
the framework from killing it after 10 mins. The framework
automatically does this every time you do a
context.write()/context.setStatus(), but if the map is stuck for 10
mins during processing some keys (may be generateSubStrings()), that
may lead to a timeout.

On Fri, Jan 20, 2012 at 9:16 AM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> We have been having problems with mappers timing out after 600 sec when the
> mapper writes many more, say thousands of records for every
> input record - even when the code in the mapper is small and fast. I have
> no idea what could cause the system to be so slow and am reluctant to raise
> the 600 sec limit without understanding why there should be a timeout when
> all MY code is very fast.
> I am enclosing a small sample which illustrates the problem. It will
> generate a 4GB text file on hdfs if the input file does not exist or is not
> at least that size and this will take some time (hours in my configuration)
> - then the code is essentially wordcount but instead of finding and
> emitting words - the mapper emits all substrings of the input data - this
> generates a much larger output data and number of output records than
> wordcount generates.
> Still, the amount of data emitted is no larger than other data sets I know
> Hadoop can handle.
> All mappers on my 8 node cluster eventually timeout after 600 sec - even
> though I see nothing in the code which is even a little slow and suspect
> that any slow behavior is in the  called Hadoop code. This is similar to a
> problem we have in bioinformatics where a  colleague saw timeouts on his 50
> node cluster.
> I would appreciate any help from the group. Note - if you have a text file
> at least 4 GB the program will take that as an imput without trying to
> create its own file.
> /*
> ===========================================================================================> */
> import org.apache.hadoop.conf.*;
> import org.apache.hadoop.fs.*;
> import org.apache.hadoop.io.*;
> import org.apache.hadoop.mapreduce.*;
> import org.apache.hadoop.mapreduce.lib.input.*;
> import org.apache.hadoop.mapreduce.lib.output.*;
> import org.apache.hadoop.util.*;
> import java.io.*;
> import java.util.*;
>  /**
>  * org.systemsbiology.hadoop.SubstringGenerator
>  *
>  * This illustrates an issue we are having where a mapper generating a
> much larger volume of
>  * data ans number of records times out even though the code is small,
> simple and fast
>  *
>  * NOTE!!! as written the program will generate a 4GB file in hdfs with
> good input data -
>  * this is done only if the file does not exist but may take several
> hours. It will only be
>  * done once. After that the failure is fairly fast
>  *
>  * What this will do is count  unique Substrings of lines of length
>  * between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all
>  * substrings and  then using the word could algorithm
>  * What is interesting is that the number and volume or writes in the
>  * map phase is MUCH larger than the number of reads and the volume of
> read data
>  *
>  * The example is artificial but similar the some real BioInformatics
> problems -
>  *  for example finding all substrings in a gemome can be important for
> the design of
>  *  microarrays.
>  *
>  *  While the real problem is more complex - the problem is that
>  *  when the input file is large enough the mappers time out failing to
> report after
>  *  600 sec. There is nothing slow in any of the application code and
> nothing I
>  */
> public class SubstringCount  implements Tool   {
>    public static final long ONE_MEG = 1024 * 1024;
>    public static final long ONE_GIG = 1024 * ONE_MEG;
>    public static final int LINE_LENGTH = 100;
>    public static final Random RND = new Random();
>   // NOTE - edit this line to be a sensible location in the current file