Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - Hadoop 1.0.4 Performance Problem


+
Jon Allen 2012-11-23, 12:02
Copy link to this message
-
Re: Hadoop 1.0.4 Performance Problem
Jie Li 2012-12-20, 00:27
Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <[EMAIL PROTECTED]> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <[EMAIL PROTECTED]> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <[EMAIL PROTECTED]> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had