I think you nailed it with "I guess I/O is not a bottle neck for me". Yes
when you can have a dedicated cpu, decompression in stream is faster that
I/O, but if your downstream process is complicated, you probably won't see
much benefit, because the decompression process will be waiting for the
You'll see a little benefit if you pig job (downstream process) is faster
than I/O but possibly slower than the decompression.
On 18 November 2012 08:25, W W <[EMAIL PROTECTED]> wrote:
> In Alan Gates' Programming in Pig , chapter "Making Pig Fly" it was
> In testing we did while developing this feature we saw performance
> improvements of up to 4x when using LZO, and slight performance degradation
> when using gzip.
> I've tried using lzo as the compression tools( took me couple of days to
> compile it ) , and also with gzip.
> The result of gzip is the same as mentioned in the book, but the result of
> with lzo is not imporvements of up to 4x , but almost the no improvement or
> slight degradation as well.
> I enabled the compression between Map and Reduce , and also between M/R
> jobs "pig.tmpfilecompression=true pig.tmpfilecompression.codec=lzo".
> From the counters I can see the HDFS bytes are compressed to about 1/3
> compared to no compress.
> I can followings in the log on TaskTracker.
> 2012-11-18 16:14:11,638 INFO
> com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
> 2012-11-18 16:14:11,639 INFO com.hadoop.compression.lzo.LzoCodec:
> Successfully loaded & initialized native-lzo library
> 2012-11-18 16:14:11,640 INFO org.apache.hadoop.io.compress.CodecPool:
> Got brand-new decompressor
> The data volume is about 6G in total, and I have 100 cpus + 150G memory
> fall on 10 nodes.
> My pig script is compiled into 4 M/R jobs. The operation in each job is :
> MAP_ONLY --> HASH_JOIN --> GROUP_BY --> HASH_JOIN .
> My guess of the reason is IO is not a bottle net for me, but was one for
> Alan Gates' case when he wrote the book.
> Any one have any clue why I didn't gain any improvement?
> Xingbang Wang
Analytical-Modeling Staff Scientist
Financial Services - Modeling
Data Fusion Laboratory