Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> structured data split


Copy link to this message
-
Re: structured data split
Thanks Harsh !...

2011/11/11 Harsh J <[EMAIL PROTECTED]>

> Sorry Bejoy, I'd typed that URL out from what I remembered on my mind.
> Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce
>
> 2011/11/11 Bejoy KS <[EMAIL PROTECTED]>:
> > Thanks Harsh for correcting me with that wonderful piece of information .
> > Cleared a wrong assumption on hdfs storage fundamentals today.
> >
> > Sorry Donal for confusing you over the same.
> >
> > Harsh,
> >        Looks like the link is broken, it'd be great if you could post the
> > url once more.
> >
> > Thanks a lot
> >
> > Regards
> > Bejoy.K.S
> >
> > On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> Bejoy,
> >> This is incorrect. As Denny had explained earlier, blocks are split
> along
> >> byte sizes alone. The writer does not concern itself with newlines and
> such.
> >> When reading, the record readers align themselves to read till the end
> of
> >> lines by communicating with the next block if they have to.
> >> This is explained neatly under
> >> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map.
> >> Regarding structured data, such as XML, one can write their custom
> >> InputFormat that returns appropriate split points after scanning
> through the
> >> entire file pre-submit (say, by looking at tags).
> >> However, if you want XML, then there is already an XMLInputFormat
> >> available in Mahout. For reading N lines at a time, use
> NLineInputFormat.
> >> On 11-Nov-2011, at 6:55 PM, [EMAIL PROTECTED] wrote:
> >>
> >> Donal
> >> In hadoop that hardly happens so. When you are storing data in hdfs it
> >> would be split line to blocks depending on end of lines, in case of
> normal
> >> files. It won't be like you'd be having half of a line in one block and
> the
> >> rest in next one. You don't need to worry on that fact.
> >> The case you mentioned is like dependent data splits. Hadoop's massive
> >> parallel processing could be fully utilized only in case of independent
> data
> >> splits. When data splits are dependent on a file level as I pointed out
> you
> >> can go for WholeFileInputFormat.
> >>
> >> Please revert if you are still confused. Also if you have some specific
> >> scenario, please put that across so we may be able to help you
> understand
> >> better on the map reduce processing of the same.
> >>
> >> Hope it clarifies...
> >> Regards
> >> Bejoy K S
> >> ________________________________
> >> From: 臧冬松 <[EMAIL PROTECTED]>
> >> Date: Fri, 11 Nov 2011 20:46:54 +0800
> >> To: <[EMAIL PROTECTED]>
> >> ReplyTo: [EMAIL PROTECTED]
> >> Subject: Re: structured data split
> >> Thanks Bejoy!
> >> It's better to process the data blocks locally and separately.
> >> I just want to know how to deal with a structure (i.e. a word,a line)
> that
> >> is split into two blocks.
> >>
> >> Cheers,
> >> Donal
> >>
> >> 在 2011年11月11日 下午7:01,Bejoy KS <[EMAIL PROTECTED]>写道:
> >>>
> >>> Hi Donal
> >>>       You can configure your map tasks the way you like to process your
> >>> input. If you have file of size 100 mb, it would be divided into two
> input
> >>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb).
> It is
> >>> your choice on how you  process the same using map reduce
> >>> - With the default TextInputFormat the two blocks would be processed by
> >>> two different mappers. (under default split settings) If the blocks
> are in
> >>> two different data nodes then two different mappers mappers would be
> spanned
> >>> in each data node in beat case. ie They are data local map tasks
> >>>  - If you want one mapper to process the whole file,change your input
> >>> format to WholeFileInputFormat. There a mapper task would be triggred
> on any
> >>> one of the node where the blocks are located. (best case) If both the
> blocks
> >>> are not on the same node then one of the blocks would be transferred
> to the
> >>> map task location for processing.
> >>>
> >>> Hope it helps!...
> >
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB