Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> Lzo compression


+
W W 2012-11-18, 16:25
Copy link to this message
-
Re: Lzo compression
I think you nailed it with "I guess I/O is not a bottle neck for me". Yes
when you can have a dedicated cpu, decompression in stream is faster that
I/O, but if your downstream process is complicated, you probably won't see
much benefit, because the decompression process will be waiting for the
downstream process.

You'll see a little benefit if you pig job (downstream process) is faster
than I/O but possibly slower than the decompression.

Kannan

On 18 November 2012 08:25, W W <[EMAIL PROTECTED]> wrote:

> hello
>
> In Alan Gates'   Programming in Pig  , chapter "Making Pig Fly"  it was
> mentioned
> In testing we did while developing this feature we saw performance
> improvements of up to 4x when using LZO, and slight performance degradation
> when using gzip.
> (http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html)
>
>
> I've tried using lzo as the compression tools( took me couple of days to
> compile it ) , and also with gzip.
> The  result of gzip is the same as mentioned in the book, but the result of
> with lzo is not imporvements of up to 4x , but almost the no improvement or
> slight degradation as well.
>
> I enabled the compression between Map and Reduce ,  and also between M/R
> jobs "pig.tmpfilecompression=true   pig.tmpfilecompression.codec=lzo".
>
> From the counters I can see the HDFS bytes are compressed to about 1/3
> compared to no compress.
> I can followings in the log on TaskTracker.
>
> 2012-11-18 16:14:11,638 INFO
> com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
> library
> 2012-11-18 16:14:11,639 INFO com.hadoop.compression.lzo.LzoCodec:
> Successfully loaded & initialized native-lzo library
> 2012-11-18 16:14:11,640 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor
>
>
> The data volume is about 6G in total, and I have 100 cpus  + 150G memory
> fall on  10 nodes.
> My pig script is compiled into 4 M/R jobs.   The operation in each job is :
>   MAP_ONLY   -->   HASH_JOIN  -->   GROUP_BY  -->   HASH_JOIN .
>
> My guess of the reason is IO is not a bottle net for me, but was one for
> Alan Gates' case when he wrote the book.
>
> Any one have any clue why I didn't gain any improvement?
>
>
> Thanks
> Regards
> Xingbang Wang
>

--
Kannan Shah

Analytical-Modeling Staff Scientist
Financial Services - Modeling
SAS Institute
San Diego

Detection-and-Estimation Group
Data Fusion Laboratory
Philadelphia
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB