Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?


Copy link to this message
-
Re: Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?
True, to capture the network effect you'll need to run MR on a cluster.

On Tue, May 22, 2012 at 8:57 PM, Jonathan Coveney <[EMAIL PROTECTED]>wrote:

> But you don't capture the nature of the speed benefit of less data going
> over the wire, right? I mean a lot of people use GZip, but in a hadoop
> context, it is considered too CPU intensive, and the gain in speed from
> less data going over the wire isn't enough to counteract that... I'm not
> quite sure how to establish that with other methods. I can quantify the
> cpu/size tradeoff with a microbenchmark, but not how it plays out on the
> network.
>
> 2012/5/22 Bill Graham <[EMAIL PROTECTED]>
>
> > You could also try using a microbech framework to test out various
> > compression techniques in isolation.
> >
> > On Tuesday, May 22, 2012, Jonathan Coveney wrote:
> >
> > > Will do, thanks
> > >
> > > 2012/5/22 Alan Gates <[EMAIL PROTECTED] <javascript:;>>
> > >
> > > > You might post this same question to mapred-user@hadoop.  I know
> Owen
> > > and
> > > > Arun have done a lot of analysis of these kinds of things when
> > optimizing
> > > > the terasort.  Others may have valuable feedback there as well.
> > > >
> > > > Alan.
> > > >
> > > > On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote:
> > > >
> > > > > I've been dealing some with the intermediate serialization in Pig,
> > and
> > > > will
> > > > > probably be dealing with it more in the future. When serializing,
> > there
> > > > is
> > > > > generally the time to serialize vs. space on disk tradeoff (an
> > extreme
> > > > > example being compression vs. no compression, a more nuanced one
> > being
> > > > > varint vs full int, that sort of thing). With Hadoop, generally
> > network
> > > > io
> > > > > is the bottleneck, but I'm not sure of the best way to evaluate
> > > something
> > > > > like: method X takes 3x as long to serialize, but is potentially
> 1/2
> > as
> > > > > large on disk.
> > > > >
> > > > > What are people doing in the wild?
> > > > > Jon
> > > >
> > > >
> > >
> >
> >
> > --
> > Sent from Gmail Mobile
> >
>

--
*Note that I'm no longer using my Yahoo! email address. Please email me at
[EMAIL PROTECTED] going forward.*