Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop, mail # user - Re: How can I limit reducers to one-per-node?


Copy link to this message
-
Re: How can I limit reducers to one-per-node?
Ted Dunning 2013-02-11, 05:55
For crawler type apps, typically you direct all of the URL's to crawl from
a single domain to a single reducer.  Typically, you also have many
reducers so that you can get decent bandwidth.

It is also common to consider the normal web politeness standards with a
grain of salt, particularly by taking it as an average rate and doing
several requests with a single connection, then waiting a bit longer than
would otherwise be done.  This helps the target domain and improves your
crawler's utilization.

Large scale crawlers typically work out of a large data store with a flags
column that is pinned into memory.  Successive passes of the crawler can
scan the flag column very quickly to find domains with  work to be done.
 This work can be done using map-reduce, but it is only vaguely like a
map-reduce job.

On Sun, Feb 10, 2013 at 10:48 PM, Harsh J <[EMAIL PROTECTED]> wrote:

> The suggestion to add a combiner is to help reduce the shuffle load
> (and perhaps, reduce # of reducers needed?), but it doesn't affect
> scheduling of a set number of reduce tasks nor does a scheduler care
> currently if you add that step in or not.
>
> On Mon, Feb 11, 2013 at 7:59 AM, David Parks <[EMAIL PROTECTED]>
> wrote:
> > I guess the FairScheduler is doing multiple assignments per heartbeat,
> hence
> > the behavior of multiple reduce tasks per node even when they should
> > otherwise be full distributed.
> >
> >
> >
> > Adding a combiner will change this behavior? Could you explain more?
> >
> >
> >
> > Thanks!
> >
> > David
> >
> >
> >
> >
> >
> > From: Michael Segel [mailto:[EMAIL PROTECTED]]
> > Sent: Monday, February 11, 2013 8:30 AM
> >
> >
> > To: [EMAIL PROTECTED]
> > Subject: Re: How can I limit reducers to one-per-node?
> >
> >
> >
> > Adding a combiner step first then reduce?
> >
> >
> >
> >
> >
> > On Feb 8, 2013, at 11:18 PM, Harsh J <[EMAIL PROTECTED]> wrote:
> >
> >
> >
> > Hey David,
> >
> > There's no readily available way to do this today (you may be
> > interested in MAPREDUCE-199 though) but if your Job scheduler's not
> > doing multiple-assignments on reduce tasks, then only one is assigned
> > per TT heartbeat, which gives you almost what you're looking for: 1
> > reduce task per node, round-robin'd (roughly).
> >
> > On Sat, Feb 9, 2013 at 9:24 AM, David Parks <[EMAIL PROTECTED]>
> wrote:
> >
> > I have a cluster of boxes with 3 reducers per node. I want to limit a
> > particular job to only run 1 reducer per node.
> >
> >
> >
> > This job is network IO bound, gathering images from a set of webservers.
> >
> >
> >
> > My job has certain parameters set to meet “web politeness” standards
> (e.g.
> > limit connects and connection frequency).
> >
> >
> >
> > If this job runs from multiple reducers on the same node, those per-host
> > limits will be violated.  Also, this is a shared environment and I don’t
> > want long running network bound jobs uselessly taking up all reduce
> slots.
> >
> >
> >
> >
> > --
> > Harsh J
> >
> >
> >
> > Michael Segel  | (m) 312.755.9623
> >
> > Segel and Associates
> >
> >
>
>
>
> --
> Harsh J
>