Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hive >> mail # user >> Compressed data storage in HDFS - Error


Copy link to this message
-
Re: Compressed data storage in HDFS - Error
Hi Sreenath,
All the points made on this thread are very valid. However, I wanted to add
that you should keep in mind that Gzip compression is not splittable. This
is because of the very nature of the codec. So, if your input data contains
files of size greater than HDFS block size in Gzip format, Hadoop wouldn't
be able to split these files and the entire file would be sent to a single
mapper. This reduces performance of the job.

As Vinod mentioned, Snappy is getting some traction. Definitely worth a
shot!

Good luck!
Mark

On Wed, Jun 6, 2012 at 2:07 PM, Vinod Singh <[EMAIL PROTECTED]> wrote:

> But it may payoff by saving on network IO while copying the data during
> reduce phase. Though it will vary from case to case. We had good results by
> using Snappy codec for compressing map output. Snappy provides reasonably
> good compression at faster rate.
>
> Thanks,
> Vinod
>
> http://blog.vinodsingh.com/
>
>
> On Wed, Jun 6, 2012 at 4:03 PM, Debarshi Basak <[EMAIL PROTECTED]>wrote:
>
>>  Compression is an overhead when you have a CPU intensive job
>>
>>
>> Debarshi Basak
>> Tata Consultancy Services
>> Mailto: [EMAIL PROTECTED]
>> Website: http://www.tcs.com
>> ____________________________________________
>> Experience certainty. IT Services
>> Business Solutions
>> Outsourcing
>> ____________________________________________
>>
>> -----Bejoy Ks ** wrote: -----**
>>
>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>> From: Bejoy Ks <[EMAIL PROTECTED]>
>> Date: 06/06/2012 03:37PM
>> Subject: Re: Compressed data storage in HDFS - Error
>>
>>
>> Hi Sreenath
>>
>> Output compression is more useful on storage level, when a larger file is
>> compressed it saves on hdfs blocks and there by the cluster become more
>> scalable in terms of number of files.
>>
>> Yes lzo libraries needs to be there in all task tracker nodes as well the
>> node that hosts the hive client.
>>
>> Regards
>> Bejoy KS
>>
>>   ------------------------------
>> *From:* Sreenath Menon <[EMAIL PROTECTED]>
>> *To:* [EMAIL PROTECTED]; Bejoy Ks <[EMAIL PROTECTED]>
>> *Sent:* Wednesday, June 6, 2012 3:25 PM
>> *Subject:* Re: Compressed data storage in HDFS - Error
>>
>> Hi Bejoy
>> I would like to make this clear.
>> There is no gain on processing throughput/time on compressing the data
>> stored in HDFS (not talking about intermediate compression)...wright??
>> And do I need to add the lzo libraries in Hadoop_Home/lib/native for all
>> the nodes (including the slave nodes)??
>>
>>
>>  =====-----=====-----====>> Notice: The information contained in this e-mail
>> message and/or attachments to it may contain
>> confidential or privileged information. If you are
>> not the intended recipient, any dissemination, use,
>> review, distribution, printing or copying of the
>> information contained in this e-mail message
>> and/or attachments to it are strictly prohibited. If
>> you have received this communication in error,
>> please notify us by reply e-mail or telephone and
>> immediately and permanently delete the message
>> and any attachments. Thank you
>>
>>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB