|
|
-
Lzo compressionW W 2012-11-18, 16:25
hello
In Alan Gates' Programming in Pig , chapter "Making Pig Fly" it was mentioned In testing we did while developing this feature we saw performance improvements of up to 4x when using LZO, and slight performance degradation when using gzip. (http://ofps.oreilly.com/titles/9781449302641/making_pig_fly.html) I've tried using lzo as the compression tools( took me couple of days to compile it ) , and also with gzip. The result of gzip is the same as mentioned in the book, but the result of with lzo is not imporvements of up to 4x , but almost the no improvement or slight degradation as well. I enabled the compression between Map and Reduce , and also between M/R jobs "pig.tmpfilecompression=true pig.tmpfilecompression.codec=lzo". >From the counters I can see the HDFS bytes are compressed to about 1/3 compared to no compress. I can followings in the log on TaskTracker. 2012-11-18 16:14:11,638 INFO com.hadoop.compression.lzo.GPLNativeCodeLoader: Loaded native gpl library 2012-11-18 16:14:11,639 INFO com.hadoop.compression.lzo.LzoCodec: Successfully loaded & initialized native-lzo library 2012-11-18 16:14:11,640 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor The data volume is about 6G in total, and I have 100 cpus + 150G memory fall on 10 nodes. My pig script is compiled into 4 M/R jobs. The operation in each job is : MAP_ONLY --> HASH_JOIN --> GROUP_BY --> HASH_JOIN . My guess of the reason is IO is not a bottle net for me, but was one for Alan Gates' case when he wrote the book. Any one have any clue why I didn't gain any improvement? Thanks Regards Xingbang Wang +
Kannan Shah 2012-11-18, 19:55
|