Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - Lzo compression


+
W W 2012-11-18, 16:25
Copy link to this message
-
Re: Lzo compression
Kannan Shah 2012-11-18, 19:55
I think you nailed it with "I guess I/O is not a bottle neck for me". Yes
when you can have a dedicated cpu, decompression in stream is faster that
I/O, but if your downstream process is complicated, you probably won't see
much benefit, because the decompression process will be waiting for the
downstream process.

You'll see a little benefit if you pig job (downstream process) is faster
than I/O but possibly slower than the decompression.

Kannan

On 18 November 2012 08:25, W W <[EMAIL PROTECTED]> wrote:

> hello
>
> In Alan Gates'   Programming in Pig  , chapter "Making Pig Fly"  it was
> mentioned
> In testing we did while developing this feature we saw performance
> improvements of up to 4x when using LZO, and slight performance degradation
> when using gzip.
> (http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html)
>
>
> I've tried using lzo as the compression tools( took me couple of days to
> compile it ) , and also with gzip.
> The  result of gzip is the same as mentioned in the book, but the result of
> with lzo is not imporvements of up to 4x , but almost the no improvement or
> slight degradation as well.
>
> I enabled the compression between Map and Reduce ,  and also between M/R
> jobs "pig.tmpfilecompression=true   pig.tmpfilecompression.codec=lzo".
>
> From the counters I can see the HDFS bytes are compressed to about 1/3
> compared to no compress.
> I can followings in the log on TaskTracker.
>
> 2012-11-18 16:14:11,638 INFO
> com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
> library
> 2012-11-18 16:14:11,639 INFO com.hadoop.compression.lzo.LzoCodec:
> Successfully loaded & initialized native-lzo library
> 2012-11-18 16:14:11,640 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor
>
>
> The data volume is about 6G in total, and I have 100 cpus  + 150G memory
> fall on  10 nodes.
> My pig script is compiled into 4 M/R jobs.   The operation in each job is :
>   MAP_ONLY   -->   HASH_JOIN  -->   GROUP_BY  -->   HASH_JOIN .
>
> My guess of the reason is IO is not a bottle net for me, but was one for
> Alan Gates' case when he wrote the book.
>
> Any one have any clue why I didn't gain any improvement?
>
>
> Thanks
> Regards
> Xingbang Wang
>

--
Kannan Shah

Analytical-Modeling Staff Scientist
Financial Services - Modeling
SAS Institute
San Diego

Detection-and-Estimation Group
Data Fusion Laboratory
Philadelphia