Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - does hadoop always respect setNumReduceTasks?


Copy link to this message
-
Re: does hadoop always respect setNumReduceTasks?
Bejoy Ks 2012-03-09, 19:08
Hi Jayne
      Adding on to Lance's comments, (answer to other queries)

> i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
             Yes, unless you mark it final in mapred-site.xml (you normally
never do)

>i noticed that most, if not all, my jobs have
only 1 reducer, regardless of whether or not i set
Job.setNumReduceTasks(int).
         The default value in mapred-site.xml would be 1. But should be
able to override the same in job and it is common requirement in map reduce
jobs

if i set the number of reduce
tasks to 1, is this always guaranteed? i ask because i don't know if there
is a gotcha like the combiner (where it may or may not run at all).
      Yes it is guaranteed. Combiner comes to place only if you specify
one,else no combiner is used by default.

Regards
Bejoy KS

On Fri, Mar 9, 2012 at 8:08 AM, Lance Norskog <[EMAIL PROTECTED]> wrote:

> Instead of String.hashCode() you can use the MD5 hashcode generator.
> This has not "in the wild" created a duplicate. (It has been hacked,
> but that's not relevant here.)
>
> http://snippets.dzone.com/posts/show/3686
>
> I think the Partitioner class guarantees that you will have multiple
> reducers.
>
> On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne <[EMAIL PROTECTED]>
> wrote:
> > i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
> >
> > as i am emitting items from the mapper, i expect/desire only 1 reducer to
> > get these items because i want to assign each key of the key-value input
> > pair a unique integer id. if i had 1 reducer, i can just keep a local
> > counter (with respect to the reducer instance) and increment it.
> >
> > on my local hadoop cluster, i noticed that most, if not all, my jobs have
> > only 1 reducer, regardless of whether or not i set
> > Job.setNumReduceTasks(int).
> >
> > however, as soon as i moved the code unto amazon's elastic mapreduce
> (emr),
> > i notice that there are multiple reducers. if i set the number of reduce
> > tasks to 1, is this always guaranteed? i ask because i don't know if
> there
> > is a gotcha like the combiner (where it may or may not run at all).
> >
> > also, it looks like this might not be a good idea just having 1 reducer
> (it
> > won't scale). it is most likely better if there are +1 reducers, but in
> > that case, i lose the ability to assign unique numbers to the key-value
> > pairs coming in. is there a design pattern out there that addresses this
> > issue?
> >
> > my mapper/reducer key-value pair signatures looks something like the
> > following.
> >
> > mapper(Text, Text, Text, IntWritable)
> > reducer(Text, IntWritable, IntWritable, Text)
> >
> > the mapper reads a sequence file whose key-value pairs are of type Text
> and
> > Text. i then emit Text (let's say a word) and IntWritable (let's say
> > frequency of the word).
> >
> > the reducer gets the word and its frequencies, and then assigns the word
> an
> > integer id. it emits IntWritable (the id) and Text (the word).
> >
> > i remember seeing code from mahout's API where they assign integer ids to
> > items. the items were already given an id of type long. the conversion
> they
> > make is as follows.
> >
> > public static int idToIndex(long id) {
> >  return 0x7FFFFFFF & ((int) id ^ (int) (id >>> 32));
> > }
> >
> > is there something equivalent for Text or a "word"? i was thinking about
> > simply taking the hash value of the string/word, but of course, different
> > strings can map to the same hash value.
>
>
>
> --
> Lance Norskog
> [EMAIL PROTECTED]
>