Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Problems with timeout when a Hadoop job generates a large number of key-value pairs


Copy link to this message
-
Re: Problems with timeout when a Hadoop job generates a large number of key-value pairs
I have been silent for a few days because on my cluster I was UNABLE to
reproduce the issue.
What I do see is that merge is taking a HUGE amount of time -

In my hands the mapper reaches 100% and then enters silent phase running
the compiler other merge operations. Is it your experience that the timeout
occurs in this phase??

I also find that the number of combiner input and output records are very
similar. In the problem I gave you would expect relatively few duplicates
in random substrings and in another thread I want to discuss the issue of
how much duplication is needed to justify a combiner

I will try your tuning suggestions - in the real problem I use a custom
input format and, as you suggest use more mappers than the standard

The example is contrived but the real code does something very similar with
bioinformatic data and I have other samples where a mapper will generate a
lot more data than it reads and it is important to understand how to tune
for this case

Thanks for your help
On Sun, Jan 22, 2012 at 1:03 PM, Alex Kozlov <[EMAIL PROTECTED]> wrote:

> Hi Steve, I think I was able to reproduce your problem over the weekend
> (not sure though, it may be a different problem).  In my case it was that
> the mappers were timing out during the merge phase.  I also think the
> related tickets are
> MAPREDUCE-2177<https://issues.apache.org/jira/browse/MAPREDUCE-2177>and
> MAPREDUCE-2187 <https://issues.apache.org/jira/browse/MAPREDUCE-2187>.  In
> my case I oversubscribed the cluster a bit with respect to the # of
> map/reduce slots.
>
> In general, this is quite unusual workload as every byte in the original
> dataset generates 100x of output very fast.  This workload requires a
> special tuning:
>
> undersubscribe the nodes with respect to # of mappers/reducers (say, use
> only 1/2 of the # of spindles for each
> *mapred.tasktracker.map.tasks.maximum
> * and *mapred.tasktracker.reduce.tasks.maximum*)
> increase *mapred.reduce.slowstart.completed.maps* (set too ~0.95 so that
> reducers do not interfere with working mappers)
> reduce *mapred.merge.recordsBeforeProgress* (set to 100, the default is
> 10000)
> reduce *mapred.combine.recordsBeforeProgress* (set to 100, the default is
> 10000)
> decrease the *dfs.block.size* for the input file so that each mapper
> handles less data
> increase the # of reducers so that each reducer handles less data
> increase *io.sort.mb* and child memory to decrease the # of spills
>
> Hope this helps.  Let me know.
>
> --
> Alex K
> <http://www.cloudera.com/company/press-center/hadoop-world-nyc/>
>
> On Fri, Jan 20, 2012 at 2:23 PM, Steve Lewis <[EMAIL PROTECTED]>
> wrote:
>
> > Interesting - I strongly suspect a disk IO or network problem since my
> code
> > is very simple and very fast.
> > If you  add lines to  generateSubStrings to limit String length to 100
> > characters (I think it is always that but this makes su
> >
> > public static String[] generateSubStrings(String inp, int minLength, int
> > maxLength) {
> >            // guarantee no more than 100 characters
> >            if(inp.length() > 100)
> >                      inp = inp.substring(0,100);
> >             List<String> holder = new ArrayList<String>();
> >            for (int start = 0; start < inp.length() - minLength;
> start++) {
> >                for (int end = start + minLength; end <
> > Math.min(inp.length(), start + maxLength); end++) {
> >                    try {
> >                        holder.add(inp.substring(start, end));
> >                    }
> >                    catch (Exception e) {
> >                        throw new RuntimeException(e);
> >
> >                    }
> >                }
> >            }
> >
> > On Fri, Jan 20, 2012 at 12:41 PM, Alex Kozlov <[EMAIL PROTECTED]>
> wrote:
> >
> > > Hi Steve, I ran your job on our cluster and it does not timeout.  I
> > noticed
> > > that each mapper runs for a long time: one way to avoid a timeout is to
> > > update a user counter.  As long as this counter is updated within 10

Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB