|
|
-
why one mapper process per block?
Erez Katz 2010-01-15, 23:55
I can see why it is straight forward to have a mapper process per block, it is a simple "cat block | mapper " on the other hand, when a mapper's start up time is not trivial (say I need ot load a fairly large dictionary), that scheme is not that ideal because that start up time is done per block that happened to be on that node.
What would it take to pipe ALL the blocks that are part of the input set, on a given node, to ONE mapper process?
Cheers,
Erez Katz
-
Re: why one mapper process per block?
Allen Wittenauer 2010-01-16, 00:00
On 1/15/10 3:55 PM, "Erez Katz" <[EMAIL PROTECTED]> wrote: > What would it take to pipe ALL the blocks that are part of the input set, on > a given node, to ONE mapper process?
Probably just setting mapred.min.split.size to a high enough value.
-
RE: why one mapper process per block?
Clements, Michael 2010-01-16, 00:11
I've been exploring the same question lately: capping max simultaneous tasks per node.
A file split approach would work, though it may be an indirect way of doing it.
In many cases it would be cleaner and much easier to have a max task cap setting, for example this could/should be configurable in a Fair Scheduler pool setting.
But there currently doesn't exist in Hadoop any simple means (that I know of) to set a max cap on tasks per machine, for a specific job (or pool of jobs). You have the configured setting, which is applied globally. If one or a few specific jobs need a different max, you're stuck.
So the file split size approach, while indirect and more complex than a config setting, is the only one that I know of.
The question actually has some subtlety because there is the total # of tasks for the job, and the # that will run simultaneously. In some cases, it's OK if there are a lot of tasks, so long as only 1 (or some other max cap) at a time runs per machine. In other cases, you need to limit the total # of tasks regardless of how many run simultaneously. The file split approach will control the total # of tasks for the job, which may impact (directly or indirectly) the # that run simultaneously.
-----Original Message----- From: [EMAIL PROTECTED] [mailto:[EMAIL PROTECTED] che.org] On Behalf Of Allen Wittenauer Sent: Friday, January 15, 2010 4:00 PM To: [EMAIL PROTECTED] Subject: Re: why one mapper process per block? On 1/15/10 3:55 PM, "Erez Katz" <[EMAIL PROTECTED]> wrote: > What would it take to pipe ALL the blocks that are part of the input set, on > a given node, to ONE mapper process?
Probably just setting mapred.min.split.size to a high enough value.
-
Re: why one mapper process per block?
Tom White 2010-01-16, 00:35
Have a look at CombineFileInputFormat - it can process multiple splits per map.
Tom
On Fri, Jan 15, 2010 at 4:11 PM, Clements, Michael <[EMAIL PROTECTED]> wrote: > I've been exploring the same question lately: capping max simultaneous > tasks per node. > > A file split approach would work, though it may be an indirect way of > doing it. > > In many cases it would be cleaner and much easier to have a max task cap > setting, for example this could/should be configurable in a Fair > Scheduler pool setting. > > But there currently doesn't exist in Hadoop any simple means (that I > know of) to set a max cap on tasks per machine, for a specific job (or > pool of jobs). You have the configured setting, which is applied > globally. If one or a few specific jobs need a different max, you're > stuck. > > So the file split size approach, while indirect and more complex than a > config setting, is the only one that I know of. > > The question actually has some subtlety because there is the total # of > tasks for the job, and the # that will run simultaneously. In some > cases, it's OK if there are a lot of tasks, so long as only 1 (or some > other max cap) at a time runs per machine. In other cases, you need to > limit the total # of tasks regardless of how many run simultaneously. > The file split approach will control the total # of tasks for the job, > which may impact (directly or indirectly) the # that run simultaneously. > > -----Original Message----- > From: > [EMAIL PROTECTED] > [mailto:[EMAIL PROTECTED] > che.org] On Behalf Of Allen Wittenauer > Sent: Friday, January 15, 2010 4:00 PM > To: [EMAIL PROTECTED] > Subject: Re: why one mapper process per block? > > > > > On 1/15/10 3:55 PM, "Erez Katz" <[EMAIL PROTECTED]> wrote: >> What would it take to pipe ALL the blocks that are part of the input > set, on >> a given node, to ONE mapper process? > > Probably just setting mapred.min.split.size to a high enough value. > >
|
|