Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> map side only behavior


Copy link to this message
-
Re: map side only behavior
No, the merge and sort will not happen in mapper task. And each mapper task
will generate one output file.

2010/1/29 Gang Luo <[EMAIL PROTECTED]>

> Hi all,
> If I only use map side to process my data (set # of reducers to 0 ), what
> is the behavior of hadoop? Will it merge and sort each of the spills
> generated by one mapper?
>
> -Gang
>
>
> ----- 原始邮件 ----
> 发件人: Gang Luo <[EMAIL PROTECTED]>
> 收件人: [EMAIL PROTECTED]
> 发送日期: 2010/1/29 (周��) 8:54:33 上午
> 主   题: Re: fine granularity operation on HDFS
>
> Yeah, I see how it works. Thanks Amogh.
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Amogh Vasekar <[EMAIL PROTECTED]>
> 收件人: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> 发送日期: 2010/1/28 (周��) 10:00:22 上午
> 主   题: Re: fine granularity operation on HDFS
>
> Hi Gang,
> Yes PathFilters work only on file paths. I meant you can include such type
> of logic at split level.
> The input format's getSplits() method is responsible for computing and
> adding splits to a list container, for which JT initializes mapper tasks.
> You can override the getSplits() method to add only a few , say, based on
> the location or offset, to the list. Here's the reference :
> while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
>          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
>          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
>                                   blkLocations[blkIndex].getHosts()));
>          bytesRemaining -= splitSize;
>        }
>
>        if (bytesRemaining != 0) {
>          splits.add(new FileSplit(path, length-bytesRemaining,
> bytesRemaining,
>                     blkLocations[blkLocations.length-1].getHosts()));
>
> Before splits.add you can use your logic for discarding. However, you need
> to ensure your record reader takes care of incomplete records at boundaries.
>
> To get the block locations to load separately, the FileSystem class APIs
> expose few methods like getBlockLocations etc ..
> Hope this helps.
>
> Amogh
>
> On 1/28/10 7:26 PM, "Gang Luo" <[EMAIL PROTECTED]> wrote:
>
> Thanks Amogh.
>
> For the second part of my question, I actually mean loading block
> separately from HDFS. I don't know whether it is realistic. Anyway, for my
> goal is to process different division of a file separately, to do that at
> split level is OK. But even I can get the splits from inputformat, how to
> "add only a few splits you need to mapper and discard the others"?
> (pathfilters only works for file, but not block, I think).
>
> Thanks.
> -Gang
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发,邮箱贺卡全新上线���
> http://card.mail.cn.yahoo.com/
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发,邮箱贺卡全新上线���
> http://card.mail.cn.yahoo.com/
>

--
Best Regards

Jeff Zhang