Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> map side only behavior


Copy link to this message
-
Re: map side only behavior
No, the merge and sort will not happen in mapper task. And each mapper task
will generate one output file.

2010/1/29 Gang Luo <[EMAIL PROTECTED]>

> Hi all,
> If I only use map side to process my data (set # of reducers to 0 ), what
> is the behavior of hadoop? Will it merge and sort each of the spills
> generated by one mapper?
>
> -Gang
>
>
> ----- 原始邮件 ----
> 发件人: Gang Luo <[EMAIL PROTECTED]>
> 收件人: [EMAIL PROTECTED]
> 发送日期: 2010/1/29 (周��) 8:54:33 上午
> 主   题: Re: fine granularity operation on HDFS
>
> Yeah, I see how it works. Thanks Amogh.
>
>
> -Gang
>
>
>
> ----- 原始邮件 ----
> 发件人: Amogh Vasekar <[EMAIL PROTECTED]>
> 收件人: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> 发送日期: 2010/1/28 (周��) 10:00:22 上午
> 主   题: Re: fine granularity operation on HDFS
>
> Hi Gang,
> Yes PathFilters work only on file paths. I meant you can include such type
> of logic at split level.
> The input format's getSplits() method is responsible for computing and
> adding splits to a list container, for which JT initializes mapper tasks.
> You can override the getSplits() method to add only a few , say, based on
> the location or offset, to the list. Here's the reference :
> while (((double) bytesRemaining)/splitSize > SPLIT_SLOP) {
>          int blkIndex = getBlockIndex(blkLocations, length-bytesRemaining);
>          splits.add(new FileSplit(path, length-bytesRemaining, splitSize,
>                                   blkLocations[blkIndex].getHosts()));
>          bytesRemaining -= splitSize;
>        }
>
>        if (bytesRemaining != 0) {
>          splits.add(new FileSplit(path, length-bytesRemaining,
> bytesRemaining,
>                     blkLocations[blkLocations.length-1].getHosts()));
>
> Before splits.add you can use your logic for discarding. However, you need
> to ensure your record reader takes care of incomplete records at boundaries.
>
> To get the block locations to load separately, the FileSystem class APIs
> expose few methods like getBlockLocations etc ..
> Hope this helps.
>
> Amogh
>
> On 1/28/10 7:26 PM, "Gang Luo" <[EMAIL PROTECTED]> wrote:
>
> Thanks Amogh.
>
> For the second part of my question, I actually mean loading block
> separately from HDFS. I don't know whether it is realistic. Anyway, for my
> goal is to process different division of a file separately, to do that at
> split level is OK. But even I can get the splits from inputformat, how to
> "add only a few splits you need to mapper and discard the others"?
> (pathfilters only works for file, but not block, I think).
>
> Thanks.
> -Gang
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发,邮箱贺卡全新上线���
> http://card.mail.cn.yahoo.com/
>
>
>
>      ___________________________________________________________
>  好玩贺卡等你发,邮箱贺卡全新上线���
> http://card.mail.cn.yahoo.com/
>

--
Best Regards

Jeff Zhang
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB