Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Problems with timeout when a Hadoop job generates a large number of key-value pairs

Copy link to this message
Problems with timeout when a Hadoop job generates a large number of key-value pairs
We have been having problems with mappers timing out after 600 sec when the
mapper writes many more, say thousands of records for every
input record - even when the code in the mapper is small and fast. I have
no idea what could cause the system to be so slow and am reluctant to raise
the 600 sec limit without understanding why there should be a timeout when
all MY code is very fast.

I am enclosing a small sample which illustrates the problem. It will
generate a 4GB text file on hdfs if the input file does not exist or is not
at least that size and this will take some time (hours in my configuration)
- then the code is essentially wordcount but instead of finding and
emitting words - the mapper emits all substrings of the input data - this
generates a much larger output data and number of output records than
wordcount generates.
Still, the amount of data emitted is no larger than other data sets I know
Hadoop can handle.

All mappers on my 8 node cluster eventually timeout after 600 sec - even
though I see nothing in the code which is even a little slow and suspect
that any slow behavior is in the  called Hadoop code. This is similar to a
problem we have in bioinformatics where a  colleague saw timeouts on his 50
node cluster.

I would appreciate any help from the group. Note - if you have a text file
at least 4 GB the program will take that as an imput without trying to
create its own file.
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

import java.io.*;
import java.util.*;
 * org.systemsbiology.hadoop.SubstringGenerator
  * This illustrates an issue we are having where a mapper generating a
much larger volume of
  * data ans number of records times out even though the code is small,
simple and fast
  * NOTE!!! as written the program will generate a 4GB file in hdfs with
good input data -
  * this is done only if the file does not exist but may take several
hours. It will only be
  * done once. After that the failure is fairly fast
 * What this will do is count  unique Substrings of lines of length
 * between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all
 * substrings and  then using the word could algorithm
 * What is interesting is that the number and volume or writes in the
  * map phase is MUCH larger than the number of reads and the volume of
read data
  * The example is artificial but similar the some real BioInformatics
problems -
  *  for example finding all substrings in a gemome can be important for
the design of
  *  microarrays.
  *  While the real problem is more complex - the problem is that
  *  when the input file is large enough the mappers time out failing to
report after
  *  600 sec. There is nothing slow in any of the application code and
nothing I
public class SubstringCount  implements Tool   {
    public static final long ONE_MEG = 1024 * 1024;
    public static final long ONE_GIG = 1024 * ONE_MEG;
    public static final int LINE_LENGTH = 100;
    public static final Random RND = new Random();

   // NOTE - edit this line to be a sensible location in the current file
    public static final String INPUT_FILE_PATH = "BigInputLines.txt";
   // NOTE - edit this line to be a sensible location in the current file
    public static final String OUTPUT_FILE_PATH = "output";
     // NOTE - edit this line to be the input file size - 4 * ONE_GIG
should be large but not a problem
    public static final long DESIRED_LENGTH = 4 * ONE_GIG;
    // NOTE - limits on substring length
    public static final int MINIMUM_LENGTH = 5;
    public static final int MAXIMUM_LENGTH = 32;
     * create an input file to read
     * @param fs !null file system
     * @param p  !null path
     * @throws IOException om error
    public static void guaranteeInputFile(FileSystem fs, Path p) throws
IOException {
        if (fs.isFile(p)) {
            FileStatus fileStatus = fs.getFileStatus(p);
            if (fileStatus.getLen() >= DESIRED_LENGTH)
        FSDataOutputStream open = fs.create(p);
        PrintStream ps = new PrintStream(open);
         long showInterval = DESIRED_LENGTH  / 100;
        for (long i = 0; i < DESIRED_LENGTH; i += LINE_LENGTH) {
            writeRandomLine(ps, LINE_LENGTH);
            // show progress
            if(i % showInterval == 0)  {


     * write a line with a random string of capital letters
     * @param pPs         -  output
     * @param pLineLength length of the line
    public static void writeRandomLine(final PrintStream pPs, final int
pLineLength) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < pLineLength; i++) {
            char c = (char) ('A' + RND.nextInt(26));

        String s = sb.toString();

     * Construct a Configured.
    public SubstringCount() {

     * similar to the Word Count mapper but one line generates a lot more
    public static class SubStringMapper
            extends Mapper<Object, Text, Text, IntWritable> {

         * generate a array of substrings
         * @param inp       input long string
         * @param minLength minimum substring length
         * @param maxLength maximum substring length
         * @return !null array of strings
        public static String[] generateSubStrings(String inp, int
minLength, int maxLength) {
            List<String> holder = new ArrayList<St