|
|
+
Jon Allen 2012-11-26, 21:49
+
Jie Li 2012-12-14, 01:46
+
Chris Smith 2012-12-17, 17:02
+
Todd Lipcon 2012-12-21, 01:06
-
Re: Hadoop 1.0.4 Performance Problemanand sharma 2012-12-21, 01:21
+1 the way jon elaborated it.
On Fri, Dec 21, 2012 at 6:36 AM, Todd Lipcon <[EMAIL PROTECTED]> wrote: > Hi Jon, > > FYI, this issue in the fair scheduler was fixed by > https://issues.apache.org/jira/browse/MAPREDUCE-2905 for 1.1.0. > Though it is present again in MR2: > https://issues.apache.org/jira/browse/MAPREDUCE-3268 > > -Todd > > On Wed, Nov 28, 2012 at 2:32 PM, Jon Allen <[EMAIL PROTECTED]> wrote: > > Jie, > > > > Simple answer - I got lucky (though obviously there are thing you need to > > have in place to allow you to be lucky). > > > > Before running the upgrade I ran a set of tests to baseline the cluster > > performance, e.g. terasort, gridmix and some operational jobs. Terasort > by > > itself isn't very realistic as a cluster test but it's nice and simple to > > run and is good for regression testing things after a change. > > > > After the upgrade the intention was to run the same tests and show that > the > > performance hadn't degraded (improved would have been nice but not worse > was > > the minimum). When we ran the terasort we found that performance was > about > > 50% worse - execution time had gone from 40 minutes to 60 minutes. As > I've > > said, terasort doesn't provide a realistic view of operational > performance > > but this showed that something major had changed and we needed to > understand > > it before going further. So how to go about diagnosing this ... > > > > First rule - understand what you're trying to achieve. It's very easy to > > say performance isn't good enough but performance can always be better so > > you need to know what's realistic and at what point you're going to stop > > tuning things. I had a previous baseline that I was trying to match so I > > knew what I was trying to achieve. > > > > Next thing to do is profile your job and identify where the problem is. > We > > had the full job history from the before and after jobs and comparing > these > > we saw that map performance was fairly consistent as were the reduce sort > > and reduce phases. The problem was with the shuffle, which had gone > from 20 > > minutes pre-upgrade to 40 minutes afterwards. The important thing here > is > > to make sure you've got as much information as possible. If we'd just > kept > > the overall job time then there would have been a lot more areas to look > at > > but knowing the problem was with shuffle allowed me to focus effort in > this > > area. > > > > So what had changed in the shuffle that may have slowed things down. The > > first thing we thought of was that we'd moved from a tarball deployment > to > > using the RPM so what effect might this have had on things. Our > operational > > configuration compresses the map output and in the past we've had > problems > > with Java compression libraries being used rather than native ones and > this > > has affected performance. We knew the RPM deployment had moved the > native > > library so spent some time confirming to ourselves that these were being > > used correctly (but this turned out to not be the problem). We then > spent > > time doing some process and server profiling - using dstat to look at the > > server bottlenecks and jstack/jmap to check what the task tracker and > reduce > > processes were doing. Although not directly relevant to this particular > > problem doing this was useful just to get my head around what Hadoop is > > doing at various points of the process. > > > > The next bit was one place where I got lucky - I happened to be logged > onto > > one of the worker nodes when a test job was running and I noticed that > there > > weren't any reduce tasks running on the server. This was odd as we'd > > submitted more reducers than we have servers so I'd expected at least one > > task to be running on each server. Checking the job tracker log file it > > turned out that since the upgrade the job tracker had been submitting > reduce > > tasks to only 10% of the available nodes. A different 10% each time the |