Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Problems with timeout when a Hadoop job generates a large number of key-value pairs
We have been having problems with mappers timing out after 600 sec when the
mapper writes many more, say thousands of records for every
input record - even when the code in the mapper is small and fast. I have
no idea what could cause the system to be so slow and am reluctant to raise
the 600 sec limit without understanding why there should be a timeout when
all MY code is very fast.

I am enclosing a small sample which illustrates the problem. It will
generate a 4GB text file on hdfs if the input file does not exist or is not
at least that size and this will take some time (hours in my configuration)
- then the code is essentially wordcount but instead of finding and
emitting words - the mapper emits all substrings of the input data - this
generates a much larger output data and number of output records than
wordcount generates.
Still, the amount of data emitted is no larger than other data sets I know
Hadoop can handle.

All mappers on my 8 node cluster eventually timeout after 600 sec - even
though I see nothing in the code which is even a little slow and suspect
that any slow behavior is in the  called Hadoop code. This is similar to a
problem we have in bioinformatics where a  colleague saw timeouts on his 50
node cluster.

I would appreciate any help from the group. Note - if you have a text file
at least 4 GB the program will take that as an imput without trying to
create its own file.
/*
===========================================================================================*/
import org.apache.hadoop.conf.*;
import org.apache.hadoop.fs.*;
import org.apache.hadoop.io.*;
import org.apache.hadoop.mapreduce.*;
import org.apache.hadoop.mapreduce.lib.input.*;
import org.apache.hadoop.mapreduce.lib.output.*;
import org.apache.hadoop.util.*;

import java.io.*;
import java.util.*;
 /**
 * org.systemsbiology.hadoop.SubstringGenerator
  *
  * This illustrates an issue we are having where a mapper generating a
much larger volume of
  * data ans number of records times out even though the code is small,
simple and fast
  *
  * NOTE!!! as written the program will generate a 4GB file in hdfs with
good input data -
  * this is done only if the file does not exist but may take several
hours. It will only be
  * done once. After that the failure is fairly fast
  *
 * What this will do is count  unique Substrings of lines of length
 * between MIN_SUBSTRING_LENGTH and MAX_SUBSTRING_LENGTH by generatin all
 * substrings and  then using the word could algorithm
 * What is interesting is that the number and volume or writes in the
  * map phase is MUCH larger than the number of reads and the volume of
read data
  *
  * The example is artificial but similar the some real BioInformatics
problems -
  *  for example finding all substrings in a gemome can be important for
the design of
  *  microarrays.
  *
  *  While the real problem is more complex - the problem is that
  *  when the input file is large enough the mappers time out failing to
report after
  *  600 sec. There is nothing slow in any of the application code and
nothing I
 */
public class SubstringCount  implements Tool   {
    public static final long ONE_MEG = 1024 * 1024;
    public static final long ONE_GIG = 1024 * ONE_MEG;
    public static final int LINE_LENGTH = 100;
    public static final Random RND = new Random();

   // NOTE - edit this line to be a sensible location in the current file
system
    public static final String INPUT_FILE_PATH = "BigInputLines.txt";
   // NOTE - edit this line to be a sensible location in the current file
system
    public static final String OUTPUT_FILE_PATH = "output";
     // NOTE - edit this line to be the input file size - 4 * ONE_GIG
should be large but not a problem
    public static final long DESIRED_LENGTH = 4 * ONE_GIG;
    // NOTE - limits on substring length
    public static final int MINIMUM_LENGTH = 5;
    public static final int MAXIMUM_LENGTH = 32;
    /**
     * create an input file to read
     * @param fs !null file system
     * @param p  !null path
     * @throws IOException om error
     */
    public static void guaranteeInputFile(FileSystem fs, Path p) throws
IOException {
        if (fs.isFile(p)) {
            FileStatus fileStatus = fs.getFileStatus(p);
            if (fileStatus.getLen() >= DESIRED_LENGTH)
                return;
        }
        FSDataOutputStream open = fs.create(p);
        PrintStream ps = new PrintStream(open);
         long showInterval = DESIRED_LENGTH  / 100;
        for (long i = 0; i < DESIRED_LENGTH; i += LINE_LENGTH) {
            writeRandomLine(ps, LINE_LENGTH);
            // show progress
            if(i % showInterval == 0)  {
                System.err.print(".");

            }
        }
        System.err.println("");
        ps.close();
    }

    /**
     * write a line with a random string of capital letters
     *
     * @param pPs         -  output
     * @param pLineLength length of the line
     */
    public static void writeRandomLine(final PrintStream pPs, final int
pLineLength) {
        StringBuilder sb = new StringBuilder();
        for (int i = 0; i < pLineLength; i++) {
            char c = (char) ('A' + RND.nextInt(26));
            sb.append(c);

        }
        String s = sb.toString();
        pPs.println(s);
    }

    /**
     * Construct a Configured.
     */
    public SubstringCount() {
    }

    /**
     * similar to the Word Count mapper but one line generates a lot more
output
     */
    public static class SubStringMapper
            extends Mapper<Object, Text, Text, IntWritable> {

        /**
         * generate a array of substrings
         *
         * @param inp       input long string
         * @param minLength minimum substring length
         * @param maxLength maximum substring length
         * @return !null array of strings
         */
        public static String[] generateSubStrings(String inp, int
minLength, int maxLength) {
            List<String> holder = new ArrayList<St
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB