In Alan Gates' Programming in Pig , chapter "Making Pig Fly" it was
In testing we did while developing this feature we saw performance
improvements of up to 4x when using LZO, and slight performance degradation
when using gzip.
I've tried using lzo as the compression tools( took me couple of days to
compile it ) , and also with gzip.
The result of gzip is the same as mentioned in the book, but the result of
with lzo is not imporvements of up to 4x , but almost the no improvement or
slight degradation as well.
I enabled the compression between Map and Reduce , and also between M/R
jobs "pig.tmpfilecompression=true pig.tmpfilecompression.codec=lzo".
>From the counters I can see the HDFS bytes are compressed to about 1/3
compared to no compress.
I can followings in the log on TaskTracker.
2012-11-18 16:14:11,638 INFO
com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl
2012-11-18 16:14:11,639 INFO com.hadoop.compression.lzo.LzoCodec:
Successfully loaded & initialized native-lzo library
2012-11-18 16:14:11,640 INFO org.apache.hadoop.io.compress.CodecPool:
Got brand-new decompressor
The data volume is about 6G in total, and I have 100 cpus + 150G memory
fall on 10 nodes.
My pig script is compiled into 4 M/R jobs. The operation in each job is :
MAP_ONLY --> HASH_JOIN --> GROUP_BY --> HASH_JOIN .
My guess of the reason is IO is not a bottle net for me, but was one for
Alan Gates' case when he wrote the book.
Any one have any clue why I didn't gain any improvement?