Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Hadoop >> mail # user >> does hadoop always respect setNumReduceTasks?


+
Jane Wayne 2012-03-09, 02:30
+
Lance Norskog 2012-03-09, 02:38
+
Bejoy Ks 2012-03-09, 19:08
Copy link to this message
-
Re: does hadoop always respect setNumReduceTasks?
Thanks Lance.

On Thu, Mar 8, 2012 at 9:38 PM, Lance Norskog <[EMAIL PROTECTED]> wrote:

> Instead of String.hashCode() you can use the MD5 hashcode generator.
> This has not "in the wild" created a duplicate. (It has been hacked,
> but that's not relevant here.)
>
> http://snippets.dzone.com/posts/show/3686
>
> I think the Partitioner class guarantees that you will have multiple
> reducers.
>
> On Thu, Mar 8, 2012 at 6:30 PM, Jane Wayne <[EMAIL PROTECTED]>
> wrote:
> > i am wondering if hadoop always respect Job.setNumReduceTasks(int)?
> >
> > as i am emitting items from the mapper, i expect/desire only 1 reducer to
> > get these items because i want to assign each key of the key-value input
> > pair a unique integer id. if i had 1 reducer, i can just keep a local
> > counter (with respect to the reducer instance) and increment it.
> >
> > on my local hadoop cluster, i noticed that most, if not all, my jobs have
> > only 1 reducer, regardless of whether or not i set
> > Job.setNumReduceTasks(int).
> >
> > however, as soon as i moved the code unto amazon's elastic mapreduce
> (emr),
> > i notice that there are multiple reducers. if i set the number of reduce
> > tasks to 1, is this always guaranteed? i ask because i don't know if
> there
> > is a gotcha like the combiner (where it may or may not run at all).
> >
> > also, it looks like this might not be a good idea just having 1 reducer
> (it
> > won't scale). it is most likely better if there are +1 reducers, but in
> > that case, i lose the ability to assign unique numbers to the key-value
> > pairs coming in. is there a design pattern out there that addresses this
> > issue?
> >
> > my mapper/reducer key-value pair signatures looks something like the
> > following.
> >
> > mapper(Text, Text, Text, IntWritable)
> > reducer(Text, IntWritable, IntWritable, Text)
> >
> > the mapper reads a sequence file whose key-value pairs are of type Text
> and
> > Text. i then emit Text (let's say a word) and IntWritable (let's say
> > frequency of the word).
> >
> > the reducer gets the word and its frequencies, and then assigns the word
> an
> > integer id. it emits IntWritable (the id) and Text (the word).
> >
> > i remember seeing code from mahout's API where they assign integer ids to
> > items. the items were already given an id of type long. the conversion
> they
> > make is as follows.
> >
> > public static int idToIndex(long id) {
> >  return 0x7FFFFFFF & ((int) id ^ (int) (id >>> 32));
> > }
> >
> > is there something equivalent for Text or a "word"? i was thinking about
> > simply taking the hash value of the string/word, but of course, different
> > strings can map to the same hash value.
>
>
>
> --
> Lance Norskog
> [EMAIL PROTECTED]
>
+
WangRamon 2012-03-10, 11:37
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB