Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # dev >> why one mapper process per block?


Copy link to this message
-
Re: why one mapper process per block?
Have a look at CombineFileInputFormat - it can process multiple splits per map.

Tom

On Fri, Jan 15, 2010 at 4:11 PM, Clements, Michael
<[EMAIL PROTECTED]> wrote:
> I've been exploring the same question lately: capping max simultaneous
> tasks per node.
>
> A file split approach would work, though it may be an indirect way of
> doing it.
>
> In many cases it would be cleaner and much easier to have a max task cap
> setting, for example this could/should be configurable in a Fair
> Scheduler pool setting.
>
> But there currently doesn't exist in Hadoop any simple means (that I
> know of) to set a max cap on tasks per machine, for a specific job (or
> pool of jobs). You have the configured setting, which is applied
> globally. If one or a few specific jobs need a different max, you're
> stuck.
>
> So the file split size approach, while indirect and more complex than a
> config setting, is the only one that I know of.
>
> The question actually has some subtlety because there is the total # of
> tasks for the job, and the # that will run simultaneously. In some
> cases, it's OK if there are a lot of tasks, so long as only 1 (or some
> other max cap) at a time runs per machine. In other cases, you need to
> limit the total # of tasks regardless of how many run simultaneously.
> The file split approach will control the total # of tasks for the job,
> which may impact (directly or indirectly) the # that run simultaneously.
>
> -----Original Message-----
> From:
> [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]
> che.org] On Behalf Of Allen Wittenauer
> Sent: Friday, January 15, 2010 4:00 PM
> To: [EMAIL PROTECTED]
> Subject: Re: why one mapper process per block?
>
>
>
>
> On 1/15/10 3:55 PM, "Erez Katz" <[EMAIL PROTECTED]> wrote:
>> What would it take  to pipe ALL the blocks that are part of the input
> set, on
>> a given node, to ONE mapper process?
>
> Probably just setting mapred.min.split.size to a high enough value.
>
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB