Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Benchmark Haddop and Pig UDFs


Copy link to this message
-
Re: Benchmark Haddop and Pig UDFs
Guy Bayes 2011-04-20, 22:04
One thing I would say is don't benchmark on EC2, do it on physical
hardware...

There is a test harness infrastructure for generic benchmarking at
http://bbltest.sourceforge.net/

that might be somewhat useful
Guy

On Wed, Apr 20, 2011 at 2:19 PM, Lai Will <[EMAIL PROTECTED]> wrote:

> My goal is to show that hadoop can be used for a certain use case.
> I don't need to compare the different usage forms of hadoop.
>
> So your second hint, is pretty much what I thought of doing.
>
> Do you or does anyone else already have experience in doing that?
> What technologies did you use in order to achieve that? bash script?
> python?
> How would you set up the benchmark?
>
> Best,
> Will
>
> -----Original Message-----
> From: Mridul Muralidharan [mailto:[EMAIL PROTECTED]]
> Sent: Mittwoch, 20. April 2011 23:13
> To: [EMAIL PROTECTED]
> Cc: Lai Will
> Subject: Re: Benchmark Haddop and Pig UDFs
>
>
> Not sure what the scope of the experiment is, but some useful comparisons
> could be against :
> a) job using only mapred api.
> b) hadoop streaming.
> c) pig streaming.
>
> It also depends on the actual script/job being run - if it is using
> combiners, multiple outputs, 'depth of pipeline', how many jobs you end up
> running for it, etc.
>
>
>
> If you are interested in only testing how pig scales, then interesting
> metrics could be :
> a) size of input.
> b) with/without compression.
> c) number of mappers.
> d) number of reducers.
> e) output size (depending on what you are running I guess).
>
>
> Regards,
> Mridul
>
>
> On Thursday 21 April 2011 01:27 AM, Lai Will wrote:
> > Hi there,
> >
> > I'm planning to do some performance measurements of my hadoop pig code in
> order to see how it scales.
> > Does anyone have some suggestions on how to do that?
> >
> > I thought of measuring the time needed for completion on a fixed cluster
> size by increasing the input data.
> > Then by fixing the input data and by adding cluster nodes. Does anyone
> have experience in doing that? I thought of writing a script that does
> start/stop the time and execute the pig command. Maybe there's a better way?
> >
> > Best,
> > Will
>
>