Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Input split for a streaming job!


Copy link to this message
-
Re: Input split for a streaming job!
Hi Raj
       AFAIK 0.21is an unstable release and I fear anyone would recommend that for production. You can play around with the same, a better approach would be patching your CDH3u1 with the required patches for splittable BZip2, but make sure that your new patch doesn't break anything else.
 
Regards
Bejoy K S

-----Original Message-----
From: Raj V <[EMAIL PROTECTED]>
Date: Fri, 11 Nov 2011 10:34:18
To: Tim Broberg<[EMAIL PROTECTED]>; [EMAIL PROTECTED]<[EMAIL PROTECTED]>
Reply-To: [EMAIL PROTECTED]
Subject: Re: Input split for a streaming job!

Tim

I  am using CDH3 U1. ( 0.20.2+923) which does not have the patch.

I will try and use 0.21

Raj

>________________________________
>From: Tim Broberg <[EMAIL PROTECTED]>
>To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>; Raj V <[EMAIL PROTECTED]>; Joey Echeverria <[EMAIL PROTECTED]>
>Sent: Friday, November 11, 2011 10:25 AM
>Subject: RE: Input split for a streaming job!
>
>
>
>What version of hadoop are you using?

>We just stumbled on the Jira item for BZIP2 splitting, and it appears to have been added in 0.21.

>When I diff 0.20.205 vs trunk, I see
>< public class BZip2Codec implements
>><     org.apache.hadoop.io.compress.CompressionCodec {
>>---
>>> @InterfaceAudience.Public
>>> @InterfaceStability.Evolving
>>> public class BZip2Codec implements SplittableCompressionCodec {
>So, it appears you need at least 0.21 to play with splittability in BZIP2.

>     - Tim.
>
>________________________________________
>From: Raj V [[EMAIL PROTECTED]]
>Sent: Friday, November 11, 2011 9:18 AM
>To: Joey Echeverria
>Cc: [EMAIL PROTECTED]
>Subject: Re: Input split for a streaming job!
>
>Joey,Anirudh, Bejoy
>
>I am using TextInputFormat Class. (org.apache.hadoop.mapred.TextInputFormat).
>
>And the input files were created using 32MB block size and the files are bzip2.
>
>So all things point to my input files being spliitable.
>
>I  will continue poking around.
>
>- best regards
>
>Raj
>
>
>
>>________________________________
>>From: Joey Echeverria <[EMAIL PROTECTED]>
>>To: Raj V <[EMAIL PROTECTED]>
>>Sent: Friday, November 11, 2011 2:56 AM
>>Subject: Re: Input split for a streaming job!
>>
>>U1 should be able to split the bzip2 files. What input format are you using?
>>
>>-Joey
>>
>>On Thu, Nov 10, 2011 at 9:06 PM, Raj V <[EMAIL PROTECTED]> wrote:
>>> Sorry to bother you offline.
>>> From the release notes for CDH3U1
>>> ( http://archive.cloudera.com/cdh/3/hadoop-0.20.2+923.97.releasenotes.html)
>>> I understand that split of the bzip files was available.
>>> But returning to my old problem I still see 73 mappers. Did I misunderstand
>>> something?
>>> If necessary, I can re-post the mail to the group.,
>>>
>>> ________________________________
>>> From: Joey Echeverria <[EMAIL PROTECTED]>
>>> To: [EMAIL PROTECTED]
>>> Sent: Thursday, November 10, 2011 3:11 PM
>>> Subject: Re: Input split for a streaming job!
>>>
>>> No problem. Out of curiosity, why are you still using B3?
>>>
>>> -Joey
>>>
>>> On Thu, Nov 10, 2011 at 6:07 PM, Raj V <[EMAIL PROTECTED]> wrote:
>>>> Joey
>>>> I think I know the answer. I am using CDH3B3 ( 0-20.2+737) and this does
>>>> not
>>>> seem to support bzip splitting. I should have looked before shooting off
>>>> the
>>>> email :-(
>>>> To answer your second question, I created a completely new set of input
>>>> files with dfs.block.size=32MB and used this as the input data
>>>> Raj
>>>>
>>>>
>>>> ________________________________
>>>> From: Joey Echeverria <[EMAIL PROTECTED]>
>>>> To: [EMAIL PROTECTED]
>>>> Sent: Thursday, November 10, 2011 3:02 PM
>>>> Subject: Re: Input split for a streaming job!
>>>>
>>>> It depends on the version of hadoop that you're using. Also, when you
>>>> changed the block size, did you do it on the actual files, or just the
>>>> default for new files?
>>>>
>>>> -Joey
>>>>
>>>> On Thu, Nov 10, 2011 at 5:52 PM, Raj V <[EMAIL PROTECTED]> wrote:
>>>>> Hi Joey,
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB