Have a look at CombineFileInputFormat - it can process multiple splits per map.
On Fri, Jan 15, 2010 at 4:11 PM, Clements, Michael
<[EMAIL PROTECTED]> wrote:
> I've been exploring the same question lately: capping max simultaneous
> tasks per node.
> A file split approach would work, though it may be an indirect way of
> doing it.
> In many cases it would be cleaner and much easier to have a max task cap
> setting, for example this could/should be configurable in a Fair
> Scheduler pool setting.
> But there currently doesn't exist in Hadoop any simple means (that I
> know of) to set a max cap on tasks per machine, for a specific job (or
> pool of jobs). You have the configured setting, which is applied
> globally. If one or a few specific jobs need a different max, you're
> So the file split size approach, while indirect and more complex than a
> config setting, is the only one that I know of.
> The question actually has some subtlety because there is the total # of
> tasks for the job, and the # that will run simultaneously. In some
> cases, it's OK if there are a lot of tasks, so long as only 1 (or some
> other max cap) at a time runs per machine. In other cases, you need to
> limit the total # of tasks regardless of how many run simultaneously.
> The file split approach will control the total # of tasks for the job,
> which may impact (directly or indirectly) the # that run simultaneously.
> -----Original Message-----
> [EMAIL PROTECTED]
> [mailto:[EMAIL PROTECTED]
> che.org] On Behalf Of Allen Wittenauer
> Sent: Friday, January 15, 2010 4:00 PM
> To: [EMAIL PROTECTED]
> Subject: Re: why one mapper process per block?
> On 1/15/10 3:55 PM, "Erez Katz" <[EMAIL PROTECTED]> wrote:
>> What would it take to pipe ALL the blocks that are part of the input
> set, on
>> a given node, to ONE mapper process?
> Probably just setting mapred.min.split.size to a high enough value.