Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?

Copy link to this message
Re: Is there a good benchmark to evaluate the CPU time/space tradeoff in the shuffle stage of hadoop?
Just curious - the first thing I did when I started using pig was to
test lzo/gzip/bzip, and lzo, gzip and even low compression bzip all
had tons of processor to spare. I tested native
libs and java stuff, and I could not get CPU bound until I cranked the
compression on bzip2.

Why is gzip considered too CPU intensive? I tested on my machine and
on ec2, I think with the Cloudera ec2 scripts. It seemed the clear
winner. I guess this varies a lot based on cluster configuration,
workload, use of combine, etc?

Russell Jurney http://datasyndrome.com

On May 22, 2012, at 8:58 PM, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> But you don't capture the nature of the speed benefit of less data going
> over the wire, right? I mean a lot of people use GZip, but in a hadoop
> context, it is considered too CPU intensive, and the gain in speed from
> less data going over the wire isn't enough to counteract that... I'm not
> quite sure how to establish that with other methods. I can quantify the
> cpu/size tradeoff with a microbenchmark, but not how it plays out on the
> network.
> 2012/5/22 Bill Graham <[EMAIL PROTECTED]>
>> You could also try using a microbech framework to test out various
>> compression techniques in isolation.
>> On Tuesday, May 22, 2012, Jonathan Coveney wrote:
>>> Will do, thanks
>>> 2012/5/22 Alan Gates <[EMAIL PROTECTED] <javascript:;>>
>>>> You might post this same question to mapred-user@hadoop.  I know Owen
>>> and
>>>> Arun have done a lot of analysis of these kinds of things when
>> optimizing
>>>> the terasort.  Others may have valuable feedback there as well.
>>>> Alan.
>>>> On May 22, 2012, at 12:23 PM, Jonathan Coveney wrote:
>>>>> I've been dealing some with the intermediate serialization in Pig,
>> and
>>>> will
>>>>> probably be dealing with it more in the future. When serializing,
>> there
>>>> is
>>>>> generally the time to serialize vs. space on disk tradeoff (an
>> extreme
>>>>> example being compression vs. no compression, a more nuanced one
>> being
>>>>> varint vs full int, that sort of thing). With Hadoop, generally
>> network
>>>> io
>>>>> is the bottleneck, but I'm not sure of the best way to evaluate
>>> something
>>>>> like: method X takes 3x as long to serialize, but is potentially 1/2
>> as
>>>>> large on disk.
>>>>> What are people doing in the wild?
>>>>> Jon
>> --
>> Sent from Gmail Mobile