Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Hadoop 1.0.4 Performance Problem


Copy link to this message
-
Re: Hadoop 1.0.4 Performance Problem
Hi Chris,

The standalone log analyzer was released in December and designed to
be easier to use.

Regarding the license, I think it's ok to use it in the commercial
environment for the evaluation purpose, and your feedback would help
us to improve it.

Jie

On Tue, Dec 18, 2012 at 1:02 AM, Chris Smith <[EMAIL PROTECTED]> wrote:
> Jie,
>
> Recent was over 11 months ago.  :-)
>
> Unfortunately the software licence requires that most of us 'negotiate' a
> commerical use license before we trial the software in a commercial
> environment:
> http://www.cs.duke.edu/starfish/files/SOFTWARE_LICENSE_AGREEMENT.txt and as
> clarified here:  http://www.cs.duke.edu/starfish/previous.html
>
> Under that last URL was a note that you were soon to distribute the source
> code under the Apache Software License.  Last time I asked the reply was
> that this would not happen.  Perhaps it is time to update your web pages or
> your license arrangements.  :-)
>
> I like what I saw on my home 'cluster' but have not the time to sort out
> licensing to trial this in a commercial environment.
>
> Chris
>
>
>
>
>
> On 14 December 2012 01:46, Jie Li <[EMAIL PROTECTED]> wrote:
>>
>> Hi Jon,
>>
>> Thanks for sharing these insights! Can't agree with you more!
>>
>> Recently we released a tool called Starfish Hadoop Log Analyzer for
>> analyzing the job histories. I believe it can quickly point out this
>> reduce problem you met!
>>
>> http://www.cs.duke.edu/starfish/release.html
>>
>> Jie
>>
>> On Wed, Nov 28, 2012 at 5:32 PM, Jon Allen <[EMAIL PROTECTED]> wrote:
>> > Jie,
>> >
>> > Simple answer - I got lucky (though obviously there are thing you need
>> > to
>> > have in place to allow you to be lucky).
>> >
>> > Before running the upgrade I ran a set of tests to baseline the cluster
>> > performance, e.g. terasort, gridmix and some operational jobs.  Terasort
>> > by
>> > itself isn't very realistic as a cluster test but it's nice and simple
>> > to
>> > run and is good for regression testing things after a change.
>> >
>> > After the upgrade the intention was to run the same tests and show that
>> > the
>> > performance hadn't degraded (improved would have been nice but not worse
>> > was
>> > the minimum).  When we ran the terasort we found that performance was
>> > about
>> > 50% worse - execution time had gone from 40 minutes to 60 minutes.  As
>> > I've
>> > said, terasort doesn't provide a realistic view of operational
>> > performance
>> > but this showed that something major had changed and we needed to
>> > understand
>> > it before going further.  So how to go about diagnosing this ...
>> >
>> > First rule - understand what you're trying to achieve.  It's very easy
>> > to
>> > say performance isn't good enough but performance can always be better
>> > so
>> > you need to know what's realistic and at what point you're going to stop
>> > tuning things.  I had a previous baseline that I was trying to match so
>> > I
>> > knew what I was trying to achieve.
>> >
>> > Next thing to do is profile your job and identify where the problem is.
>> > We
>> > had the full job history from the before and after jobs and comparing
>> > these
>> > we saw that map performance was fairly consistent as were the reduce
>> > sort
>> > and reduce phases.  The problem was with the shuffle, which had gone
>> > from 20
>> > minutes pre-upgrade to 40 minutes afterwards.  The important thing here
>> > is
>> > to make sure you've got as much information as possible.  If we'd just
>> > kept
>> > the overall job time then there would have been a lot more areas to look
>> > at
>> > but knowing the problem was with shuffle allowed me to focus effort in
>> > this
>> > area.
>> >
>> > So what had changed in the shuffle that may have slowed things down.
>> > The
>> > first thing we thought of was that we'd moved from a tarball deployment
>> > to
>> > using the RPM so what effect might this have had on things.  Our
>> > operational
>> > configuration compresses the map output and in the past we've had
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB