Jon Allen 2012-11-23, 12:02
-Re: Hadoop 1.0.4 Performance Problem
Jie Li 2012-12-20, 00:27
The standalone log analyzer was released in December and designed to
be easier to use.
Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.
On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <[EMAIL PROTECTED]> wrote:
> Recent was over 11 months ago. :-)
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here: http://www.cs.duke.edu/starfish/previous.html
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License. Last time I asked the reply was
> that this would not happen. Perhaps it is time to update your web pages or
> your license arrangements. :-)
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
> On 14 December 2012 01:46, Jie Li <[EMAIL PROTECTED]> wrote:
>> Hi Jon,
>> Thanks for sharing these insights! Can't agree with you more!
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <[EMAIL PROTECTED]> wrote:
>> > Jie,
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs. Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum). When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes. As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further. So how to go about diagnosing this ...
>> > First rule - understand what you're trying to achieve. It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things. I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases. The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards. The important thing here
>> > is
>> > to make sure you've got as much information as possible. If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things. Our
>> > operational
>> > configuration compresses the map output and in the past we've had