-Re: Hadoop - Distributed sorting
samir das mohapatra 2012-05-15, 18:35
Steps to do this:
1) Map: It will only define the key value for each number
2) Combiner : To sort locally over chunk of dataset .
3) Reducer: It will sort after over whole chunk globally-------------->
OUT PUT as sorted
Note: set combiner and reducer as Same class.
Let us assume that our data set (integers) is constrained between 100 to
200 and we have 5 files each containing 1000 random integers between 100
and 200 (so a total of 5000 integers between 100 and 200). We read each
file into a Map and then in the Reduce phase, we produce a final Map which
contains the count of all the integers. Now if we sort all the integers
from the final Map and output it
into a list data structure in the form of <Integer, Count> then we have
sorted all the data (see figure below). Aside : In Java, you don’t even
have to come up with the data-structure that I am talking about, if you
just use a TreeMap<http://java.sun.com/javase/6/docs/api/index.html?java/util/TreeMap.html>in
the final Reduce phase, then all the keys (i.e. data) are already
as long as the key type (e.g. String, Integer, etc.) implements the
Hadoop <http://hadoop.apache.org/> has something similar called
I am using a TreeMap that takes Strings as keys in
On Tue, May 15, 2012 at 11:31 PM, @dataElGrande <[EMAIL PROTECTED]>wrote:
> Check out Pentaho's howto's when dealing with Hadoop or NoSQL or anything
> data related. http://wiki.pentaho.com/display/BAD/How+To%27s
> madhu_sushmi wrote:
> > Hi,
> > I need to implement distributed sorting using Hadoop. I am quite new to
> > Hadoop and I am getting confused. If I want to implement Merge sort, what
> > my Map and reduce should be doing. ? Should all the sorting happen at
> > reduce side?
> > Please help. This is an urgent requirement. Please guide me.
> > Thanks,
> > Madhu
> View this message in context:
> Sent from the Hadoop core-dev mailing list archive at Nabble.com.