Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS, mail # user - Now give .gz file as input to the MAP


Copy link to this message
-
Re: Now give .gz file as input to the MAP
Rahul Bhattacharjee 2013-06-12, 17:47
Yeah I too found that quite slow and memory hungry !

Thanks,
Rahul-da
On Wed, Jun 12, 2013 at 11:13 PM, Sanjay Subramanian <
[EMAIL PROTECTED]> wrote:

>  Rahul-da
>
>  I found bz2 pretty slow (although splittable) so I switched to snappy
> (only sequence files are splittable but compress-decompress is fast)
>
>  Thanks
> Sanjay
>
>   From: Rahul Bhattacharjee <[EMAIL PROTECTED]>
> Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Date: Tuesday, June 11, 2013 9:53 PM
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> Subject: Re: Now give .gz file as input to the MAP
>
>   Nothing special is required for process .gz files using MR. however ,
> as Sanjay mentioned , verify the codec's configured in core-site and
> another thing to note is that these files are not splittable.
>
>  You might want to use bz2 , these are splittable.
>
> Thanks,
> Rahul
>
>
> On Wed, Jun 12, 2013 at 10:14 AM, Sanjay Subramanian <
> [EMAIL PROTECTED]> wrote:
>
>>  hadoopConf.set("mapreduce.job.inputformat.class",
>> "com.wizecommerce.utils.mapred.TextInputFormat");
>>
>> hadoopConf.set("mapreduce.job.outputformat.class",
>> "com.wizecommerce.utils.mapred.TextOutputFormat");
>>  No special settings required for reading Gzip except these above
>>
>>  I u want to output Gzip
>>
>>  hadoopConf.set("mapreduce.output.fileoutputformat.compress", "true");
>>
>> hadoopConf.set("mapreduce.output.fileoutputformat.compress.codec",
>> "org.apache.hadoop.io.compress.GzipCodec");
>>
>> Make sure Gzip codec is defined in core-site.xml
>>  <!-- core-site.xml -->
>>  <property>
>>      <name>io.compression.codecs</name>
>>      <value
>> >org.apache.hadoop.io.compress.GzipCodec,org.apache.hadoop.io.compress.DefaultCodec</
>> value>
>>  </property>
>>
>>  I have a question
>>
>>  Why are u using GZIP as input to Map ? These are not splittable…Unless
>> u have to read multilines (like lines between a BEGIN and END block in a
>> log file) and send it as one record to the mapper
>>
>>  Also in Non-splitable Snappy Codec is better
>>
>>  Good Luck
>>
>>
>>  sanjay
>>
>>   From: samir das mohapatra <[EMAIL PROTECTED]>
>> Reply-To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>> Date: Tuesday, June 11, 2013 9:07 PM
>> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>, "
>> [EMAIL PROTECTED]" <[EMAIL PROTECTED]>, "
>> [EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>> Subject: Now give .gz file as input to the MAP
>>
>>   Hi All,
>>     Did any one worked on, how to pass the .gz file as  file input for
>> mapreduce job ?
>>
>> Regards,
>> samir.
>>
>> CONFIDENTIALITY NOTICE
>> =====================>> This email message and any attachments are for the exclusive use of the
>> intended recipient(s) and may contain confidential and privileged
>> information. Any unauthorized review, use, disclosure or distribution is
>> prohibited. If you are not the intended recipient, please contact the
>> sender by reply email and destroy all copies of the original message along
>> with any attachments, from your computer system. If you are the intended
>> recipient, please be advised that the content of this message is subject to
>> access, review and disclosure by the sender's Email System Administrator.
>>
>
>
> CONFIDENTIALITY NOTICE
> =====================> This email message and any attachments are for the exclusive use of the
> intended recipient(s) and may contain confidential and privileged
> information. Any unauthorized review, use, disclosure or distribution is
> prohibited. If you are not the intended recipient, please contact the
> sender by reply email and destroy all copies of the original message along
> with any attachments, from your computer system. If you are the intended
> recipient, please be advised that the content of this message is subject to
> access, review and disclosure by the sender's Email System Administrator.
>