Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> structured data split


Copy link to this message
-
Re: structured data split
Thanks Harsh !...

2011/11/11 Harsh J <[EMAIL PROTECTED]>

> Sorry Bejoy, I'd typed that URL out from what I remembered on my mind.
> Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce
>
> 2011/11/11 Bejoy KS <[EMAIL PROTECTED]>:
> > Thanks Harsh for correcting me with that wonderful piece of information .
> > Cleared a wrong assumption on hdfs storage fundamentals today.
> >
> > Sorry Donal for confusing you over the same.
> >
> > Harsh,
> >        Looks like the link is broken, it'd be great if you could post the
> > url once more.
> >
> > Thanks a lot
> >
> > Regards
> > Bejoy.K.S
> >
> > On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >>
> >> Bejoy,
> >> This is incorrect. As Denny had explained earlier, blocks are split
> along
> >> byte sizes alone. The writer does not concern itself with newlines and
> such.
> >> When reading, the record readers align themselves to read till the end
> of
> >> lines by communicating with the next block if they have to.
> >> This is explained neatly under
> >> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map.
> >> Regarding structured data, such as XML, one can write their custom
> >> InputFormat that returns appropriate split points after scanning
> through the
> >> entire file pre-submit (say, by looking at tags).
> >> However, if you want XML, then there is already an XMLInputFormat
> >> available in Mahout. For reading N lines at a time, use
> NLineInputFormat.
> >> On 11-Nov-2011, at 6:55 PM, [EMAIL PROTECTED] wrote:
> >>
> >> Donal
> >> In hadoop that hardly happens so. When you are storing data in hdfs it
> >> would be split line to blocks depending on end of lines, in case of
> normal
> >> files. It won't be like you'd be having half of a line in one block and
> the
> >> rest in next one. You don't need to worry on that fact.
> >> The case you mentioned is like dependent data splits. Hadoop's massive
> >> parallel processing could be fully utilized only in case of independent
> data
> >> splits. When data splits are dependent on a file level as I pointed out
> you
> >> can go for WholeFileInputFormat.
> >>
> >> Please revert if you are still confused. Also if you have some specific
> >> scenario, please put that across so we may be able to help you
> understand
> >> better on the map reduce processing of the same.
> >>
> >> Hope it clarifies...
> >> Regards
> >> Bejoy K S
> >> ________________________________
> >> From: 臧冬松 <[EMAIL PROTECTED]>
> >> Date: Fri, 11 Nov 2011 20:46:54 +0800
> >> To: <[EMAIL PROTECTED]>
> >> ReplyTo: [EMAIL PROTECTED]
> >> Subject: Re: structured data split
> >> Thanks Bejoy!
> >> It's better to process the data blocks locally and separately.
> >> I just want to know how to deal with a structure (i.e. a word,a line)
> that
> >> is split into two blocks.
> >>
> >> Cheers,
> >> Donal
> >>
> >> 在 2011年11月11日 下午7:01,Bejoy KS <[EMAIL PROTECTED]>写道:
> >>>
> >>> Hi Donal
> >>>       You can configure your map tasks the way you like to process your
> >>> input. If you have file of size 100 mb, it would be divided into two
> input
> >>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb).
> It is
> >>> your choice on how you  process the same using map reduce
> >>> - With the default TextInputFormat the two blocks would be processed by
> >>> two different mappers. (under default split settings) If the blocks
> are in
> >>> two different data nodes then two different mappers mappers would be
> spanned
> >>> in each data node in beat case. ie They are data local map tasks
> >>>  - If you want one mapper to process the whole file,change your input
> >>> format to WholeFileInputFormat. There a mapper task would be triggred
> on any
> >>> one of the node where the blocks are located. (best case) If both the
> blocks
> >>> are not on the same node then one of the blocks would be transferred
> to the
> >>> map task location for processing.
> >>>
> >>> Hope it helps!...
> >