臧冬松 2011-11-11, 07:43
Denny Ye 2011-11-11, 09:50
臧冬松 2011-11-11, 10:11
Bejoy KS 2011-11-11, 11:01
臧冬松 2011-11-11, 12:46
bejoy.hadoop@... 2011-11-11, 13:25
Harsh J 2011-11-11, 13:54
Bejoy KS 2011-11-11, 14:38
Harsh J 2011-11-11, 16:06
Thanks Harsh !...
2011/11/11 Harsh J <[EMAIL PROTECTED]>
> Sorry Bejoy, I'd typed that URL out from what I remembered on my mind.
> Fixed link is: http://wiki.apache.org/hadoop/HadoopMapReduce
> 2011/11/11 Bejoy KS <[EMAIL PROTECTED]>:
> > Thanks Harsh for correcting me with that wonderful piece of information .
> > Cleared a wrong assumption on hdfs storage fundamentals today.
> > Sorry Donal for confusing you over the same.
> > Harsh,
> > Looks like the link is broken, it'd be great if you could post the
> > url once more.
> > Thanks a lot
> > Regards
> > Bejoy.K.S
> > On Fri, Nov 11, 2011 at 7:24 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >> Bejoy,
> >> This is incorrect. As Denny had explained earlier, blocks are split
> >> byte sizes alone. The writer does not concern itself with newlines and
> >> When reading, the record readers align themselves to read till the end
> >> lines by communicating with the next block if they have to.
> >> This is explained neatly under
> >> http://wiki.apache.org/Hadoop/MapReduceArch, para 2 of Map.
> >> Regarding structured data, such as XML, one can write their custom
> >> InputFormat that returns appropriate split points after scanning
> through the
> >> entire file pre-submit (say, by looking at tags).
> >> However, if you want XML, then there is already an XMLInputFormat
> >> available in Mahout. For reading N lines at a time, use
> >> On 11-Nov-2011, at 6:55 PM, [EMAIL PROTECTED] wrote:
> >> Donal
> >> In hadoop that hardly happens so. When you are storing data in hdfs it
> >> would be split line to blocks depending on end of lines, in case of
> >> files. It won't be like you'd be having half of a line in one block and
> >> rest in next one. You don't need to worry on that fact.
> >> The case you mentioned is like dependent data splits. Hadoop's massive
> >> parallel processing could be fully utilized only in case of independent
> >> splits. When data splits are dependent on a file level as I pointed out
> >> can go for WholeFileInputFormat.
> >> Please revert if you are still confused. Also if you have some specific
> >> scenario, please put that across so we may be able to help you
> >> better on the map reduce processing of the same.
> >> Hope it clarifies...
> >> Regards
> >> Bejoy K S
> >> ________________________________
> >> From: 臧冬松 <[EMAIL PROTECTED]>
> >> Date: Fri, 11 Nov 2011 20:46:54 +0800
> >> To: <[EMAIL PROTECTED]>
> >> ReplyTo: [EMAIL PROTECTED]
> >> Subject: Re: structured data split
> >> Thanks Bejoy!
> >> It's better to process the data blocks locally and separately.
> >> I just want to know how to deal with a structure (i.e. a word,a line)
> >> is split into two blocks.
> >> Cheers,
> >> Donal
> >> 在 2011年11月11日 下午7:01，Bejoy KS <[EMAIL PROTECTED]>写道：
> >>> Hi Donal
> >>> You can configure your map tasks the way you like to process your
> >>> input. If you have file of size 100 mb, it would be divided into two
> >>> blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb).
> It is
> >>> your choice on how you process the same using map reduce
> >>> - With the default TextInputFormat the two blocks would be processed by
> >>> two different mappers. (under default split settings) If the blocks
> are in
> >>> two different data nodes then two different mappers mappers would be
> >>> in each data node in beat case. ie They are data local map tasks
> >>> - If you want one mapper to process the whole file,change your input
> >>> format to WholeFileInputFormat. There a mapper task would be triggred
> on any
> >>> one of the node where the blocks are located. (best case) If both the
> >>> are not on the same node then one of the blocks would be transferred
> to the
> >>> map task location for processing.
> >>> Hope it helps!...
臧冬松 2011-11-11, 14:12
Will Maier 2011-11-11, 14:26
Charles Earl 2011-11-11, 14:42
Bejoy KS 2011-11-11, 15:10
臧冬松 2011-11-11, 15:57
臧冬松 2011-11-14, 08:32