Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
MapReduce >> mail # user >> Re: Hadoop 1.0.4 Performance Problem


+
Jon Allen 2012-11-26, 21:49
+
Jie Li 2012-12-14, 01:46
+
Chris Smith 2012-12-17, 17:02
+
Todd Lipcon 2012-12-21, 01:06
Copy link to this message
-
Re: Hadoop 1.0.4 Performance Problem
+1 the way jon elaborated it.
On Fri, Dec 21, 2012 at 6:36 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote:

> Hi Jon,
>
> FYI, this issue in the fair scheduler was fixed by
> https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
> Though it is present again in MR2:
> https://issues.apache.org/jira/browse/MAPREDUCE-3268
>
> -Todd
>
> On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <[EMAIL PROTECTED]> wrote:
> > Jie,
> >
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> >
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
> by
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> >
> > After the upgrade the intention was to run the same tests and show that
> the
> > performance hadn't degraded (improved would have been nice but not worse
> was
> > the minimum).  When we ran the terasort we found that performance was
> about
> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
> I've
> > said, terasort doesn't provide a realistic view of operational
> performance
> > but this showed that something major had changed and we needed to
> understand
> > it before going further.  So how to go about diagnosing this ...
> >
> > First rule - understand what you're trying to achieve.  It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things.  I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> >
> > Next thing to do is profile your job and identify where the problem is.
>  We
> > had the full job history from the before and after jobs and comparing
> these
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases.  The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
> is
> > to make sure you've got as much information as possible.  If we'd just
> kept
> > the overall job time then there would have been a lot more areas to look
> at
> > but knowing the problem was with shuffle allowed me to focus effort in
> this
> > area.
> >
> > So what had changed in the shuffle that may have slowed things down.  The
> > first thing we thought of was that we'd moved from a tarball deployment
> to
> > using the RPM so what effect might this have had on things.  Our
> operational
> > configuration compresses the map output and in the past we've had
> problems
> > with Java compression libraries being used rather than native ones and
> this
> > has affected performance.  We knew the RPM deployment had moved the
> native
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem).  We then
> spent
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> reduce
> > processes were doing.  Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> >
> > The next bit was one place where I got lucky - I happened to be logged
> onto
> > one of the worker nodes when a test job was running and I noticed that
> there
> > weren't any reduce tasks running on the server.  This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server.  Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> reduce
> > tasks to only 10% of the available nodes.  A different 10% each time the
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB