Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
HDFS >> mail # user >> structured data split

Copy link to this message
Re: structured data split
Hi Donal
      You can configure your map tasks the way you like to process your
input. If you have file of size 100 mb, it would be divided into two input
blocks and stored in hdfs ( if your dfs.block.size is default 64 Mb). It is
your choice on how you  process the same using map reduce
- With the default TextInputFormat the two blocks would be processed by two
different mappers. (under default split settings) If the blocks are in two
different data nodes then two different mappers mappers would be spanned in
each data node in beat case. ie They are data local map tasks
 - If you want one mapper to process the whole file,change your input
format to WholeFileInputFormat. There a mapper task would be triggred on
any one of the node where the blocks are located. (best case) If both the
blocks are not on the same node then one of the blocks would be transferred
to the map task location for processing.

Hope it helps!...

Thank You

2011/11/11 臧冬松 <[EMAIL PROTECTED]>

> Thanks Denny!
> So that means each map task will have to read from another DataNode
> inorder to read the end line of the previous block?
> Cheers,
> Donal
> 2011/11/11 Denny Ye <[EMAIL PROTECTED]>
>> hi
>>    Structured data is always being split into different blocks, likes a
>> word or line.
>>    MapReduce task read HDFS data with the unit - *line* - it will read
>> the whole line from the end of previous block to start of subsequent to
>> obtains that part of line record. So you does not worry about the
>> Incomplete structured data. HDFS do nothing for this mechanism.
>> -Regards
>> Denny Ye
>> On Fri, Nov 11, 2011 at 3:43 PM, 臧冬松 <[EMAIL PROTECTED]> wrote:
>>> Usually large file in HDFS is split into bulks and store in different
>>> DataNodes.
>>> A map task is assigned to deal with that bulk, I wonder what if the
>>> Structured data(i.e a word) was split into two bulks?
>>> How MapReduce and HDFS deal with this?
>>> Thanks!
>>> Donal