+1 the way jon elaborated it.
On Fri, Dec 21, 2012 at 6:36 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote:
> Hi Jon,
> FYI, this issue in the fair scheduler was fixed by
> https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0.
> Though it is present again in MR2:
> On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <[EMAIL PROTECTED]> wrote:
> > Jie,
> > Simple answer - I got lucky (though obviously there are thing you need to
> > have in place to allow you to be lucky).
> > Before running the upgrade I ran a set of tests to baseline the cluster
> > performance, e.g. terasort, gridmix and some operational jobs. Terasort
> > itself isn't very realistic as a cluster test but it's nice and simple to
> > run and is good for regression testing things after a change.
> > After the upgrade the intention was to run the same tests and show that
> > performance hadn't degraded (improved would have been nice but not worse
> > the minimum). When we ran the terasort we found that performance was
> > 50% worse - execution time had gone from 40 minutes to 60 minutes. As
> > said, terasort doesn't provide a realistic view of operational
> > but this showed that something major had changed and we needed to
> > it before going further. So how to go about diagnosing this ...
> > First rule - understand what you're trying to achieve. It's very easy to
> > say performance isn't good enough but performance can always be better so
> > you need to know what's realistic and at what point you're going to stop
> > tuning things. I had a previous baseline that I was trying to match so I
> > knew what I was trying to achieve.
> > Next thing to do is profile your job and identify where the problem is.
> > had the full job history from the before and after jobs and comparing
> > we saw that map performance was fairly consistent as were the reduce sort
> > and reduce phases. The problem was with the shuffle, which had gone
> from 20
> > minutes pre-upgrade to 40 minutes afterwards. The important thing here
> > to make sure you've got as much information as possible. If we'd just
> > the overall job time then there would have been a lot more areas to look
> > but knowing the problem was with shuffle allowed me to focus effort in
> > area.
> > So what had changed in the shuffle that may have slowed things down. The
> > first thing we thought of was that we'd moved from a tarball deployment
> > using the RPM so what effect might this have had on things. Our
> > configuration compresses the map output and in the past we've had
> > with Java compression libraries being used rather than native ones and
> > has affected performance. We knew the RPM deployment had moved the
> > library so spent some time confirming to ourselves that these were being
> > used correctly (but this turned out to not be the problem). We then
> > time doing some process and server profiling - using dstat to look at the
> > server bottlenecks and jstack/jmap to check what the task tracker and
> > processes were doing. Although not directly relevant to this particular
> > problem doing this was useful just to get my head around what Hadoop is
> > doing at various points of the process.
> > The next bit was one place where I got lucky - I happened to be logged
> > one of the worker nodes when a test job was running and I noticed that
> > weren't any reduce tasks running on the server. This was odd as we'd
> > submitted more reducers than we have servers so I'd expected at least one
> > task to be running on each server. Checking the job tracker log file it
> > turned out that since the upgrade the job tracker had been submitting
> > tasks to only 10% of the available nodes. A different 10% each time the