|
Sreenath Menon
2012-06-06, 08:50
Debarshi Basak
2012-06-06, 09:14
Debarshi Basak
2012-06-06, 09:15
Debarshi Basak
2012-06-06, 09:19
Siddharth Tiwari
2012-06-06, 09:28
Sreenath Menon
2012-06-06, 09:38
Sreenath Menon
2012-06-06, 09:40
Bejoy Ks
2012-06-06, 09:45
Bejoy Ks
2012-06-06, 09:48
Sreenath Menon
2012-06-06, 09:55
Bejoy Ks
2012-06-06, 10:06
Debarshi Basak
2012-06-06, 10:33
Vinod Singh
2012-06-06, 18:07
Mark Grover
2012-06-09, 00:08
Edward Capriolo
2012-06-09, 00:54
Raja Thiruvathuru
2012-06-09, 03:42
Sreenath Menon
2012-06-09, 04:20
Denny Lee
2012-06-09, 04:22
Sreenath Menon
2012-06-09, 04:28
|
-
Compressed data storage in HDFS - ErrorSreenath Menon 2012-06-06, 08:50
I would like to compress my data in the HDFS using some Hive commands.
Step followed: (data already residing in table sample) create table rc_lzo like sample; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; insert overwrite table rc_lzo select * from sample; Error: Compression codec com\.hadoop\.compression\.lzo\.LzoCodec was not found 1)What do I need to do to use Lzo as well as other compression methods? 2)Heard somewhere that :Using compressed data will produce better results than uncompressed data in some cases. How can this be, as there is always a compression and decompression time allotted with compression methods. Any truth in this, if so how ? Can understand how there are better results when using compression between mappers-to-reducers and in between map-reduce jobs. Thanks and Regards Sreenath Mullassery
-
Re: Compressed data storage in HDFS - ErrorDebarshi Basak 2012-06-06, 09:14
<font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2">LZO doesn't ship with apache hadoop you need to build it..try GZ<BR><BR><BR>Debarshi Basak<BR>Tata Consultancy Services<BR>Mailto: [EMAIL PROTECTED]<BR>Website: http://www.tcs.com<BR>____________________________________________<BR>Experience certainty. IT Services<BR>Business Solutions<BR>Outsourcing<BR>____________________________________________<BR><BR><FONT color=#990099>-----Sreenath Menon <[EMAIL PROTECTED]>wrote: -----</FONT>
<DIV> <BLOCKQUOTE style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: black 2px solid; MARGIN-RIGHT: 0px">To: [EMAIL PROTECTED]<BR>From: Sreenath Menon <[EMAIL PROTECTED]><BR>Date: 06/06/2012 02:20PM<BR>Subject: Compressed data storage in HDFS - Error<BR><BR>I would like to compress my data in the HDFS using some Hive commands.<BR>Step followed: (data already residing in table sample)<BR><BR>create table rc_lzo like sample;<BR>SET hive.exec.compress.output=true;<BR>SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;<BR>insert overwrite table rc_lzo select * from sample;<BR><BR>Error:<BR>Compression codec com\.hadoop\.compression\.lzo\.LzoCodec was not found<BR><BR>1)What do I need to do to use Lzo as well as other compression methods?<BR><BR>2)Heard somewhere that :Using compressed data will produce better results than uncompressed data in some cases. How can this be, as there is always a compression and decompression time allotted with compression methods. Any truth in this, if so how ? Can understand how there are better results when using compression between mappers-to-reducers and in between map-reduce jobs.<BR><BR>Thanks and Regards<BR>Sreenath Mullassery<BR></BLOCKQUOTE></DIV> <DIV></DIV></font><p>=====-----=====-----=====<br> Notice: The information contained in this e-mail<br> message and/or attachments to it may contain <br> confidential or privileged information. If you are <br> not the intended recipient, any dissemination, use, <br> review, distribution, printing or copying of the <br> information contained in this e-mail message <br> and/or attachments to it are strictly prohibited. If <br> you have received this communication in error, <br> please notify us by reply e-mail or telephone and <br> immediately and permanently delete the message <br> and any attachments. Thank you</p> <p></p>
-
Re: Compressed data storage in HDFS - ErrorDebarshi Basak 2012-06-06, 09:15
<font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2">Yes performance is better because your IO is less when your data is less<BR><BR><BR>Debarshi Basak<BR>Tata Consultancy Services<BR>Mailto: [EMAIL PROTECTED]<BR>Website: http://www.tcs.com<BR>____________________________________________<BR>Experience certainty. IT Services<BR>Business Solutions<BR>Outsourcing<BR>____________________________________________<BR><BR><FONT color=#990099>-----Sreenath Menon <[EMAIL PROTECTED]>wrote: -----</FONT>
<DIV> <BLOCKQUOTE style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: black 2px solid; MARGIN-RIGHT: 0px">To: [EMAIL PROTECTED]<BR>From: Sreenath Menon <[EMAIL PROTECTED]><BR>Date: 06/06/2012 02:20PM<BR>Subject: Compressed data storage in HDFS - Error<BR><BR>I would like to compress my data in the HDFS using some Hive commands.<BR>Step followed: (data already residing in table sample)<BR><BR>create table rc_lzo like sample;<BR>SET hive.exec.compress.output=true;<BR>SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;<BR>insert overwrite table rc_lzo select * from sample;<BR><BR>Error:<BR>Compression codec com\.hadoop\.compression\.lzo\.LzoCodec was not found<BR><BR>1)What do I need to do to use Lzo as well as other compression methods?<BR><BR>2)Heard somewhere that :Using compressed data will produce better results than uncompressed data in some cases. How can this be, as there is always a compression and decompression time allotted with compression methods. Any truth in this, if so how ? Can understand how there are better results when using compression between mappers-to-reducers and in between map-reduce jobs.<BR><BR>Thanks and Regards<BR>Sreenath Mullassery<BR></BLOCKQUOTE></DIV> <DIV></DIV></font><p>=====-----=====-----=====<br> Notice: The information contained in this e-mail<br> message and/or attachments to it may contain <br> confidential or privileged information. If you are <br> not the intended recipient, any dissemination, use, <br> review, distribution, printing or copying of the <br> information contained in this e-mail message <br> and/or attachments to it are strictly prohibited. If <br> you have received this communication in error, <br> please notify us by reply e-mail or telephone and <br> immediately and permanently delete the message <br> and any attachments. Thank you</p> <p></p>
-
Re: Compressed data storage in HDFS - ErrorDebarshi Basak 2012-06-06, 09:19
<font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2">Basically, when your data is compressed you have lesser IO than your uncompressd data. During job execution is doesn't decompress. It would be a relevant question in Hadoop's mailing list than hive.<BR><BR><BR>Debarshi Basak<BR>Tata Consultancy Services<BR>Mailto: [EMAIL PROTECTED]<BR>Website: http://www.tcs.com<BR>____________________________________________<BR>Experience certainty. IT Services<BR>Business Solutions<BR>Outsourcing<BR>____________________________________________<BR><BR><FONT color=#990099>-----Sreenath Menon <[EMAIL PROTECTED]>wrote: -----</FONT>
<DIV> <BLOCKQUOTE style="PADDING-RIGHT: 0px; PADDING-LEFT: 5px; MARGIN-LEFT: 5px; BORDER-LEFT: black 2px solid; MARGIN-RIGHT: 0px">To: [EMAIL PROTECTED]<BR>From: Sreenath Menon <[EMAIL PROTECTED]><BR>Date: 06/06/2012 02:20PM<BR>Subject: Compressed data storage in HDFS - Error<BR><BR>I would like to compress my data in the HDFS using some Hive commands.<BR>Step followed: (data already residing in table sample)<BR><BR>create table rc_lzo like sample;<BR>SET hive.exec.compress.output=true;<BR>SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec;<BR>insert overwrite table rc_lzo select * from sample;<BR><BR>Error:<BR>Compression codec com\.hadoop\.compression\.lzo\.LzoCodec was not found<BR><BR>1)What do I need to do to use Lzo as well as other compression methods?<BR><BR>2)Heard somewhere that :Using compressed data will produce better results than uncompressed data in some cases. How can this be, as there is always a compression and decompression time allotted with compression methods. Any truth in this, if so how ? Can understand how there are better results when using compression between mappers-to-reducers and in between map-reduce jobs.<BR><BR>Thanks and Regards<BR>Sreenath Mullassery<BR></BLOCKQUOTE></DIV> <DIV></DIV></font><p>=====-----=====-----=====<br> Notice: The information contained in this e-mail<br> message and/or attachments to it may contain <br> confidential or privileged information. If you are <br> not the intended recipient, any dissemination, use, <br> review, distribution, printing or copying of the <br> information contained in this e-mail message <br> and/or attachments to it are strictly prohibited. If <br> you have received this communication in error, <br> please notify us by reply e-mail or telephone and <br> immediately and permanently delete the message <br> and any attachments. Thank you</p> <p></p>
-
RE: Compressed data storage in HDFS - ErrorSiddharth Tiwari 2012-06-06, 09:28
There is something you gain and something you loose.
Compression would reduce IO through increased cpu work . Also you would receive different experience for different tasks ie HDFS read , HDFS write , shuffle and sort . So to go for compression or not depends on your usages . Sent from my N8 -----Original Message----- From: Sreenath Menon Sent: 6/6/2012 8:50:23 AM To: [EMAIL PROTECTED] Subject: Compressed data storage in HDFS - Error I would like to compress my data in the HDFS using some Hive commands. Step followed: (data already residing in table sample) create table rc_lzo like sample; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; insert overwrite table rc_lzo select * from sample; Error: Compression codec com\.hadoop\.compression\.lzo\.LzoCodec was not found 1)What do I need to do to use Lzo as well as other compression methods? 2)Heard somewhere that :Using compressed data will produce better results than uncompressed data in some cases. How can this be, as there is always a compression and decompression time allotted with compression methods. Any truth in this, if so how ? Can understand how there are better results when using compression between mappers-to-reducers and in between map-reduce jobs. Thanks and Regards Sreenath Mullassery
-
Re: Compressed data storage in HDFS - ErrorSreenath Menon 2012-06-06, 09:38
Thanks for the response.
1)How do I use the Gz compression and does it come with Hadoop. Or else how do I build a compression method for using in Hive. I would like to run evaluation across compression methods. What is the default compression used in Hadoop. 2)Kindly bear with me if this question is stupid. I am not talking about compression within intermediate steps. Storing the raw data in compressed format, how can this be useful since data needs to be decompressed for executing a job...wright?.
-
Re: Compressed data storage in HDFS - ErrorSreenath Menon 2012-06-06, 09:40
k...understood...so you load the compressed data into memory (thereby
decreasing the size of file needed to be loaded) and then apply decompression algorithm to get the uncompressed data. is this what happens?
-
Re: Compressed data storage in HDFS - ErrorBejoy Ks 2012-06-06, 09:45
Hi Sreenath
The lzo error is because you don't have the lzo libraries in Hadoop_Home/lib/native folder. You need to pack/build lzo for the OS you are using. In compression as you mentioned there is an overhead in decompressing while processing the records. HDFS is used to store large amount of data so compression saves much on storage space (consider replication as well). Now it is not final output compression that speeds up map reduce jobs but it the intermediate compression that has this advantage. Intermediate compression means compression of map output. In a map reduce job there is much of copy and shuffle happening between the map and reduce phases, when this intermediate data is compressed this operation is faster as it consumes much lesser IO. The following properties enables intermediate compression mapred.compress.map.output=true mapred.map.output.compression.codec= hadoop.compression.lzo.LzoCodec Regards Bejoy KS ________________________________ From: Siddharth Tiwari <[EMAIL PROTECTED]> To: "[EMAIL PROTECTED] " <[EMAIL PROTECTED]> Sent: Wednesday, June 6, 2012 2:58 PM Subject: RE: Compressed data storage in HDFS - Error There is something you gain and something you loose. Compression would reduce IO through increased cpu work . Also you would receive different experience for different tasks ie HDFS read , HDFS write , shuffle and sort . So to go for compression or not depends on your usages . Sent from my N8 -----Original Message----- From: Sreenath Menon Sent: 6/6/2012 8:50:23 AM To: [EMAIL PROTECTED] Subject: Compressed data storage in HDFS - Error I would like to compress my data in the HDFS using some Hive commands. Step followed: (data already residing in table sample) create table rc_lzo like sample; SET hive.exec.compress.output=true; SET mapred.output.compression.codec=com.hadoop.compression.lzo.LzoCodec; insert overwrite table rc_lzo select * from sample; Error: Compression codec com\.hadoop\.compression\.lzo\.LzoCodec was not found 1)What do I need to do to use Lzo as well as other compression methods? 2)Heard somewhere that :Using compressed data will produce better results than uncompressed data in some cases. How can this be, as there is always a compression and decompression time allotted with compression methods. Any truth in this, if so how ? Can understand how there are better results when using compression between mappers-to-reducers and in between map-reduce jobs. Thanks and Regards Sreenath Mullassery
-
Re: Compressed data storage in HDFS - ErrorBejoy Ks 2012-06-06, 09:48
Hi Sreenath
The default compression codec used in hadoop is org.apache.hadoop.io.compress.DefaultCodec To use gzip as compression mapred.output.compress=truemapred.output.compression.codec=org.apache.hadoop.io.compress.GzipCodec Regards Bejoy KS ________________________________ From: Sreenath Menon <[EMAIL PROTECTED]> To: [EMAIL PROTECTED] Sent: Wednesday, June 6, 2012 3:08 PM Subject: Re: Compressed data storage in HDFS - Error Thanks for the response. 1)How do I use the Gz compression and does it come with Hadoop. Or else how do I build a compression method for using in Hive. I would like to run evaluation across compression methods. What is the default compression used in Hadoop. 2)Kindly bear with me if this question is stupid. I am not talking about compression within intermediate steps. Storing the raw data in compressed format, how can this be useful since data needs to be decompressed for executing a job...wright?.
-
Re: Compressed data storage in HDFS - ErrorSreenath Menon 2012-06-06, 09:55
Hi Bejoy
I would like to make this clear. There is no gain on processing throughput/time on compressing the data stored in HDFS (not talking about intermediate compression)...wright?? And do I need to add the lzo libraries in Hadoop_Home/lib/native for all the nodes (including the slave nodes)??
-
Re: Compressed data storage in HDFS - ErrorBejoy Ks 2012-06-06, 10:06
Hi Sreenath Output compression is more useful on storage level, when a larger file is compressed it saves on hdfs blocks and there by the cluster become more scalable in terms of number of files. Yes lzo libraries needs to be there in all task tracker nodes as well the node that hosts the hive client. Regards Bejoy KS ________________________________ From: Sreenath Menon <[EMAIL PROTECTED]> To: [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> Sent: Wednesday, June 6, 2012 3:25 PM Subject: Re: Compressed data storage in HDFS - Error Hi Bejoy I would like to make this clear. There is no gain on processing throughput/time on compressing the data stored in HDFS (not talking about intermediate compression)...wright?? And do I need to add the lzo libraries in Hadoop_Home/lib/native for all the nodes (including the slave nodes)??
-
Re: Compressed data storage in HDFS - ErrorDebarshi Basak 2012-06-06, 10:33
<font face="Default Sans Serif,Verdana,Arial,Helvetica,sans-serif" size="2"> Compression is an overhead when you have a CPU intensive job<br><br><br>Debarshi Basak<br>Tata Consultancy Services<br>Mailto: [EMAIL PROTECTED]<br>Website: http://www.tcs.com<br>____________________________________________<br>Experience certainty. IT Services<br> Business Solutions<br> Outsourcing<br>____________________________________________<br><br><font color="#990099">-----Bejoy Ks <[EMAIL PROTECTED]> wrote: -----</[EMAIL PROTECTED]></font><div><blockquote style="border-left: 2px solid black; padding-right: 0px; padding-left: 5px; margin-left: 5px; margin-right: 0px;">To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]><br>From: Bejoy Ks <[EMAIL PROTECTED]><br>Date: 06/06/2012 03:37PM<br>Subject: Re: Compressed data storage in HDFS - Error<br><br><div style="color: rgb(0, 0, 0); background-color: rgb(255, 255, 255); font-family: verdana,helvetica,sans-serif; font-size: 10pt;"><div><span><br></span></div><div>Hi Sreenath</div><div><br></div><div>Output compression is more useful on storage level, when a larger file is compressed it saves on hdfs blocks and there by the cluster become more scalable in terms of number of files. </div><div><br></div><div>Yes lzo libraries needs to be there in all task tracker nodes as well the node that hosts the hive client.</div><div><br></div><div>Regards</div><div>Bejoy KS<br></div><div><br></div><div></div><div style="font-family: verdana,helvetica,sans-serif; font-size: 10pt;"> <div style="font-family: times new roman,new york,times,serif; font-size: 12pt;"> <div dir="ltr"> <font face="Arial" size="2"> <hr size="1"> <b><span style="font-weight: bold;">From:</span></b> Sreenath Menon <[EMAIL PROTECTED]><br> <b><span style="font-weight: bold;">To:</span></b> [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> <br> <b><span style="font-weight: bold;">Sent:</span></b> Wednesday, June 6, 2012 3:25 PM<br> <b><span style="font-weight: bold;">Subject:</span></b> Re: Compressed data storage in HDFS - Error<br> </font> </div> <br>
<!--Notes ACF <meta http-equiv="x-dns-prefetch-control" content="off">--><div id="yiv802454005">Hi Bejoy<br>I would like to make this clear.<br>There is no gain on processing throughput/time on compressing the data stored in HDFS (not talking about intermediate compression)...wright??<br>And do I need to add the <span>lzo libraries in Hadoop_Home/lib/native for all the nodes (including the slave nodes)??<br> </span> </div><!--Notes ACF <meta http-equiv="x-dns-prefetch-control" content="on">--><br><br> </div> </div> </div></blockquote></div><div></div></font><p>=====-----=====-----=====<br> Notice: The information contained in this e-mail<br> message and/or attachments to it may contain <br> confidential or privileged information. If you are <br> not the intended recipient, any dissemination, use, <br> review, distribution, printing or copying of the <br> information contained in this e-mail message <br> and/or attachments to it are strictly prohibited. If <br> you have received this communication in error, <br> please notify us by reply e-mail or telephone and <br> immediately and permanently delete the message <br> and any attachments. Thank you</p> <p></p>
-
Re: Compressed data storage in HDFS - ErrorVinod Singh 2012-06-06, 18:07
But it may payoff by saving on network IO while copying the data during
reduce phase. Though it will vary from case to case. We had good results by using Snappy codec for compressing map output. Snappy provides reasonably good compression at faster rate. Thanks, Vinod http://blog.vinodsingh.com/ On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <[EMAIL PROTECTED]>wrote: > Compression is an overhead when you have a CPU intensive job > > > Debarshi Basak > Tata Consultancy Services > Mailto: [EMAIL PROTECTED] > Website: http://www.tcs.com > ____________________________________________ > Experience certainty. IT Services > Business Solutions > Outsourcing > ____________________________________________ > > -----Bejoy Ks ** wrote: -----** > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > From: Bejoy Ks <[EMAIL PROTECTED]> > Date: 06/06/2012 03:37PM > Subject: Re: Compressed data storage in HDFS - Error > > > Hi Sreenath > > Output compression is more useful on storage level, when a larger file is > compressed it saves on hdfs blocks and there by the cluster become more > scalable in terms of number of files. > > Yes lzo libraries needs to be there in all task tracker nodes as well the > node that hosts the hive client. > > Regards > Bejoy KS > > ------------------------------ > *From:* Sreenath Menon <[EMAIL PROTECTED]> > *To:* [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> > *Sent:* Wednesday, June 6, 2012 3:25 PM > *Subject:* Re: Compressed data storage in HDFS - Error > > Hi Bejoy > I would like to make this clear. > There is no gain on processing throughput/time on compressing the data > stored in HDFS (not talking about intermediate compression)...wright?? > And do I need to add the lzo libraries in Hadoop_Home/lib/native for all > the nodes (including the slave nodes)?? > > > =====-----=====-----====> Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > >
-
Re: Compressed data storage in HDFS - ErrorMark Grover 2012-06-09, 00:08
Hi Sreenath,
All the points made on this thread are very valid. However, I wanted to add that you should keep in mind that Gzip compression is not splittable. This is because of the very nature of the codec. So, if your input data contains files of size greater than HDFS block size in Gzip format, Hadoop wouldn't be able to split these files and the entire file would be sent to a single mapper. This reduces performance of the job. As Vinod mentioned, Snappy is getting some traction. Definitely worth a shot! Good luck! Mark On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <[EMAIL PROTECTED]> wrote: > But it may payoff by saving on network IO while copying the data during > reduce phase. Though it will vary from case to case. We had good results by > using Snappy codec for compressing map output. Snappy provides reasonably > good compression at faster rate. > > Thanks, > Vinod > > http://blog.vinodsingh.com/ > > > On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <[EMAIL PROTECTED]>wrote: > >> Compression is an overhead when you have a CPU intensive job >> >> >> Debarshi Basak >> Tata Consultancy Services >> Mailto: [EMAIL PROTECTED] >> Website: http://www.tcs.com >> ____________________________________________ >> Experience certainty. IT Services >> Business Solutions >> Outsourcing >> ____________________________________________ >> >> -----Bejoy Ks ** wrote: -----** >> >> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >> From: Bejoy Ks <[EMAIL PROTECTED]> >> Date: 06/06/2012 03:37PM >> Subject: Re: Compressed data storage in HDFS - Error >> >> >> Hi Sreenath >> >> Output compression is more useful on storage level, when a larger file is >> compressed it saves on hdfs blocks and there by the cluster become more >> scalable in terms of number of files. >> >> Yes lzo libraries needs to be there in all task tracker nodes as well the >> node that hosts the hive client. >> >> Regards >> Bejoy KS >> >> ------------------------------ >> *From:* Sreenath Menon <[EMAIL PROTECTED]> >> *To:* [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> >> *Sent:* Wednesday, June 6, 2012 3:25 PM >> *Subject:* Re: Compressed data storage in HDFS - Error >> >> Hi Bejoy >> I would like to make this clear. >> There is no gain on processing throughput/time on compressing the data >> stored in HDFS (not talking about intermediate compression)...wright?? >> And do I need to add the lzo libraries in Hadoop_Home/lib/native for all >> the nodes (including the slave nodes)?? >> >> >> =====-----=====-----====>> Notice: The information contained in this e-mail >> message and/or attachments to it may contain >> confidential or privileged information. If you are >> not the intended recipient, any dissemination, use, >> review, distribution, printing or copying of the >> information contained in this e-mail message >> and/or attachments to it are strictly prohibited. If >> you have received this communication in error, >> please notify us by reply e-mail or telephone and >> immediately and permanently delete the message >> and any attachments. Thank you >> >> >
-
Re: Compressed data storage in HDFS - ErrorEdward Capriolo 2012-06-09, 00:54
Compression will make processing faster almost all the time. Gzip
compression can shrink a text file to 40 percent its original size. Snappy maybe about 60 percent. On average. Then your dealing with say 1tb of data 60 percent savings is 600 gb. If you think about the disk and network savings that will eclipse any CPU waist. Advice use snappy for intermediate compression and gzip for final On Friday, June 8, 2012, Mark Groveir <[EMAIL PROTECTED]> wrote: > Hi Sreenath, > All the points made on this thread are very valid. However, I wanted to add that you should keep in mind that Gzip compression is not splittable. This is because of the very nature of the codec. So, if your input data contains files of size greater than HDFS block size in Gzip format, Hadoop wouldn't be able to split these files and the entire file would be sent to a single mapper. This reduces performance of the job. > As Vinod mentioned, Snappy is getting some traction. Definitely worth a shot! > Good luck! > Mark > > On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <[EMAIL PROTECTED]> wrote: >> >> But it may payoff by saving on network IO while copying the data during reduce phase. Though it will vary from case to case. We had good results by using Snappy codec for compressing map output. Snappy provides reasonably good compression at faster rate. >> Thanks, >> Vinod >> >> http://blog.vinodsingh.com/ >> >> On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <[EMAIL PROTECTED]> wrote: >>> >>> Compression is an overhead when you have a CPU intensive job >>> >>> >>> Debarshi Basak >>> Tata Consultancy Services >>> Mailto: [EMAIL PROTECTED] >>> Website: http://www.tcs.com >>> ____________________________________________ >>> Experience certainty. IT Services >>> Business Solutions >>> Outsourcing >>> ____________________________________________ >>> >>> -----Bejoy Ks wrote: ----- >>> >>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>> From: Bejoy Ks <[EMAIL PROTECTED]> >>> Date: 06/06/2012 03:37PM >>> Subject: Re: Compressed data storage in HDFS - Error >>> >>> >>> Hi Sreenath >>> Output compression is more useful on storage level, when a larger file is compressed it saves on hdfs blocks and there by the cluster become more scalable in terms of number of files. >>> Yes lzo libraries needs to be there in all task tracker nodes as well the node that hosts the hive client. >>> Regards >>> Bejoy KS >>> >>> ________________________________ >>> From: Sreenath Menon <[EMAIL PROTECTED]> >>> To: [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> >>> Sent: Wednesday, June 6, 2012 3:25 PM >>> Subject: Re: Compressed data storage in HDFS - Error >>> >>> Hi Bejoy >>> I would like to make this clear. >>> There is no gain on processing throughput/time on compressing the data stored in HDFS (not talking about intermediate compression)...wright?? >>> And do I need to add the lzo libraries in Hadoop_Home/lib/native for all the nodes (including the slave nodes)?? >>> >>> >>> =====-----=====-----====>>> Notice: The information contained in this e-mail >>> message and/or attachments to it may contain >>> confidential or privileged information. If you are >>> not the intended recipient, any dissemination, use, >>> review, distribution, printing or copying of the >>> information contained in this e-mail message >>> and/or attachments to it are strictly prohibited. If >>> you have received this communication in error, >>> please notify us by reply e-mail or telephone and >>> immediately and permanently delete the message >>> and any attachments. Thank you > >
-
Re: Compressed data storage in HDFS - ErrorRaja Thiruvathuru 2012-06-09, 03:42
Agree with Mark.
On Fri, Jun 8, 2012 at 5:08 PM, Mark Grover <[EMAIL PROTECTED]>wrote: > Hi Sreenath, > All the points made on this thread are very valid. However, I wanted to > add that you should keep in mind that Gzip compression is not splittable. > This is because of the very nature of the codec. So, if your input data > contains files of size greater than HDFS block size in Gzip format, Hadoop > wouldn't be able to split these files and the entire file would be sent to > a single mapper. This reduces performance of the job. > > As Vinod mentioned, Snappy is getting some traction. Definitely worth a > shot! > > Good luck! > Mark > > On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <[EMAIL PROTECTED]> wrote: > >> But it may payoff by saving on network IO while copying the data during >> reduce phase. Though it will vary from case to case. We had good results by >> using Snappy codec for compressing map output. Snappy provides reasonably >> good compression at faster rate. >> >> Thanks, >> Vinod >> >> http://blog.vinodsingh.com/ >> >> >> On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <[EMAIL PROTECTED]>wrote: >> >>> Compression is an overhead when you have a CPU intensive job >>> >>> >>> Debarshi Basak >>> Tata Consultancy Services >>> Mailto: [EMAIL PROTECTED] >>> Website: http://www.tcs.com >>> ____________________________________________ >>> Experience certainty. IT Services >>> Business Solutions >>> Outsourcing >>> ____________________________________________ >>> >>> -----Bejoy Ks ** wrote: -----** >>> >>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> >>> From: Bejoy Ks <[EMAIL PROTECTED]> >>> Date: 06/06/2012 03:37PM >>> Subject: Re: Compressed data storage in HDFS - Error >>> >>> >>> Hi Sreenath >>> >>> Output compression is more useful on storage level, when a larger file >>> is compressed it saves on hdfs blocks and there by the cluster become more >>> scalable in terms of number of files. >>> >>> Yes lzo libraries needs to be there in all task tracker nodes as well >>> the node that hosts the hive client. >>> >>> Regards >>> Bejoy KS >>> >>> ------------------------------ >>> *From:* Sreenath Menon <[EMAIL PROTECTED]> >>> *To:* [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> >>> *Sent:* Wednesday, June 6, 2012 3:25 PM >>> *Subject:* Re: Compressed data storage in HDFS - Error >>> >>> Hi Bejoy >>> I would like to make this clear. >>> There is no gain on processing throughput/time on compressing the data >>> stored in HDFS (not talking about intermediate compression)...wright?? >>> And do I need to add the lzo libraries in Hadoop_Home/lib/native for >>> all the nodes (including the slave nodes)?? >>> >>> >>> =====-----=====-----====>>> Notice: The information contained in this e-mail >>> message and/or attachments to it may contain >>> confidential or privileged information. If you are >>> not the intended recipient, any dissemination, use, >>> review, distribution, printing or copying of the >>> information contained in this e-mail message >>> and/or attachments to it are strictly prohibited. If >>> you have received this communication in error, >>> please notify us by reply e-mail or telephone and >>> immediately and permanently delete the message >>> and any attachments. Thank you >>> >>> >> > -- Raja Thiruvathuru
-
Re: Compressed data storage in HDFS - ErrorSreenath Menon 2012-06-09, 04:20
Any idea about lzo or bzip2...any of these splittable??
-
Re: Compressed data storage in HDFS - ErrorDenny Lee 2012-06-09, 04:22
Out of curiosity, why not bz2 which is splittable? Definitely will try out snappy in the meantime. Thanks!
@dennylee | http://about.me/dennylee On Jun 8, 2012, at 8:42 PM, Raja Thiruvathuru <[EMAIL PROTECTED]> wrote: > Agree with Mark. > > On Fri, Jun 8, 2012 at 5:08 PM, Mark Grover <[EMAIL PROTECTED]> wrote: > Hi Sreenath, > All the points made on this thread are very valid. However, I wanted to add that you should keep in mind that Gzip compression is not splittable. This is because of the very nature of the codec. So, if your input data contains files of size greater than HDFS block size in Gzip format, Hadoop wouldn't be able to split these files and the entire file would be sent to a single mapper. This reduces performance of the job. > > As Vinod mentioned, Snappy is getting some traction. Definitely worth a shot! > > Good luck! > Mark > > On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <[EMAIL PROTECTED]> wrote: > But it may payoff by saving on network IO while copying the data during reduce phase. Though it will vary from case to case. We had good results by using Snappy codec for compressing map output. Snappy provides reasonably good compression at faster rate. > > Thanks, > Vinod > > http://blog.vinodsingh.com/ > > > On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <[EMAIL PROTECTED]> wrote: > Compression is an overhead when you have a CPU intensive job > > > Debarshi Basak > Tata Consultancy Services > Mailto: [EMAIL PROTECTED] > Website: http://www.tcs.com > ____________________________________________ > Experience certainty. IT Services > Business Solutions > Outsourcing > ____________________________________________ > > -----Bejoy Ks wrote: ----- > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]> > From: Bejoy Ks <[EMAIL PROTECTED]> > Date: 06/06/2012 03:37PM > Subject: Re: Compressed data storage in HDFS - Error > > > Hi Sreenath > > Output compression is more useful on storage level, when a larger file is compressed it saves on hdfs blocks and there by the cluster become more scalable in terms of number of files. > > Yes lzo libraries needs to be there in all task tracker nodes as well the node that hosts the hive client. > > Regards > Bejoy KS > > From: Sreenath Menon <[EMAIL PROTECTED]> > To: [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]> > Sent: Wednesday, June 6, 2012 3:25 PM > Subject: Re: Compressed data storage in HDFS - Error > > Hi Bejoy > I would like to make this clear. > There is no gain on processing throughput/time on compressing the data stored in HDFS (not talking about intermediate compression)...wright?? > And do I need to add the lzo libraries in Hadoop_Home/lib/native for all the nodes (including the slave nodes)?? > > > =====-----=====-----====> Notice: The information contained in this e-mail > message and/or attachments to it may contain > confidential or privileged information. If you are > not the intended recipient, any dissemination, use, > review, distribution, printing or copying of the > information contained in this e-mail message > and/or attachments to it are strictly prohibited. If > you have received this communication in error, > please notify us by reply e-mail or telephone and > immediately and permanently delete the message > and any attachments. Thank you > > > > > > > -- > > Raja Thiruvathuru
-
Re: Compressed data storage in HDFS - ErrorSreenath Menon 2012-06-09, 04:28
OK I am getting a little confused now.
Consider that I am working on a scenario where there is no limit with memory available. In such scenario, is there any advantage of storing data in HDFS in compressed format. Any advantage, like, if node 1 has data available and it is executing a particular task and node2 is free, then data needs to be transferred from node 1 to 2 write?? any network advantage or anything on storing the data in HDFS in compressed formats. Am not talking about compression in the intermediate steps (like mapper-reducer or between mapreduce jobs), but the compression on data stored in HDFS, which needs to be decompressed for proceesing, which provides processing time overheads. |