Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?


Copy link to this message
-
Re: Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?
I am afraid that the space vs. time conversion factor is hardware
dependent, and as such cannot be optimized apriori.
Imagine having a 100Mbit ethernet vs. Infiniband connection, o running
hadoop on embedded processors vs. top end servers.
There is no single-size-fits-all unfortunately.
The only choice I see is have sane defaults for the common case and allow
personalization for the extreme cases.
That is, try it on an average cluster of commodity machines, varying the
size from small (tens) to very large (thousands) and see what happens.

Cheers,
--
Gianmarco
On Wed, May 23, 2012 at 7:07 AM, Russell Jurney <[EMAIL PROTECTED]>wrote:

> Just curious - the first thing I did when I started using pig was to
> test lzo/gzip/bzip, and lzo, gzip and even low compression bzip all
> had tons of processor to spare. I tested native
> libs and java stuff, and I could not get CPU bound until I cranked the
> compression on bzip2.
>
> Why is gzip considered too CPU intensive? I tested on my machine and
> on ec2, I think with the Cloudera ec2 scripts. It seemed the clear
> winner. I guess this varies a lot based on cluster configuration,
> workload, use of combine, etc?
>
> Russell Jurney http://datasyndrome.com
>
> On May 22, 2012, at 8:58 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
>
> > But you don't capture the nature of the speed benefit of less data going
> > over the wire, right? I mean a lot of people use GZip, but in a hadoop
> > context, it is considered too CPU intensive, and the gain in speed from
> > less data going over the wire isn't enough to counteract that... I'm not
> > quite sure how to establish that with other methods. I can quantify the
> > cpu/size tradeoff with a microbenchmark, but not how it plays out on the
> > network.
> >
> > 2012/5/22 Bill Graham <[EMAIL PROTECTED]>
> >
> >> You could also try using a microbech framework to test out various
> >> compression techniques in isolation.
> >>
> >> On Tuesday, May 22, 2012, Jonathan Coveney wrote:
> >>
> >>> Will do, thanks
> >>>
> >>> 2012/5/22 Alan Gates <[EMAIL PROTECTED] <javascript:;>>
> >>>
> >>>> You might post this same question to mapred-user@hadoop.  I know Owen
> >>> and
> >>>> Arun have done a lot of analysis of these kinds of things when
> >> optimizing
> >>>> the terasort.  Others may have valuable feedback there as well.
> >>>>
> >>>> Alan.
> >>>>
> >>>> On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote:
> >>>>
> >>>>> I've been dealing some with the intermediate serialization in Pig,
> >> and
> >>>> will
> >>>>> probably be dealing with it more in the future. When serializing,
> >> there
> >>>> is
> >>>>> generally the time to serialize vs. space on disk tradeoff (an
> >> extreme
> >>>>> example being compression vs. no compression, a more nuanced one
> >> being
> >>>>> varint vs full int, that sort of thing). With Hadoop, generally
> >> network
> >>>> io
> >>>>> is the bottleneck, but I'm not sure of the best way to evaluate
> >>> something
> >>>>> like: method X takes 3x as long to serialize, but is potentially 1/2
> >> as
> >>>>> large on disk.
> >>>>>
> >>>>> What are people doing in the wild?
> >>>>> Jon
> >>>>
> >>>>
> >>>
> >>
> >>
> >> --
> >> Sent from Gmail Mobile
> >>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB