Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # dev - Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?


+
Jonathan Coveney 2012-05-22, 19:23
+
Alan Gates 2012-05-22, 21:26
+
Jonathan Coveney 2012-05-23, 00:09
Copy link to this message
-
Re: Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?
Bill Graham 2012-05-23, 03:35
You could also try using a microbech framework to test out various
compression techniques in isolation.

On Tuesday, May 22, 2012, Jonathan Coveney wrote:

> Will do, thanks
>
> 2012/5/22 Alan Gates <[EMAIL PROTECTED] <javascript:;>>
>
> > You might post this same question to mapred-user@hadoop.  I know Owen
> and
> > Arun have done a lot of analysis of these kinds of things when optimizing
> > the terasort.  Others may have valuable feedback there as well.
> >
> > Alan.
> >
> > On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote:
> >
> > > I've been dealing some with the intermediate serialization in Pig, and
> > will
> > > probably be dealing with it more in the future. When serializing, there
> > is
> > > generally the time to serialize vs. space on disk tradeoff (an extreme
> > > example being compression vs. no compression, a more nuanced one being
> > > varint vs full int, that sort of thing). With Hadoop, generally network
> > io
> > > is the bottleneck, but I'm not sure of the best way to evaluate
> something
> > > like: method X takes 3x as long to serialize, but is potentially 1/2 as
> > > large on disk.
> > >
> > > What are people doing in the wild?
> > > Jon
> >
> >
>
--
Sent from Gmail Mobile
+
Jonathan Coveney 2012-05-23, 03:57
+
Russell Jurney 2012-05-23, 05:07
+
Gianmarco De Francisci Mo... 2012-05-23, 22:44
+
Bill Graham 2012-05-23, 05:03