Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> distributing a time consuming single reduce task


Copy link to this message
-
Re: distributing a time consuming single reduce task
In general keeping the values you iterate through in memory in the
inputSet   is a bad idea -
How many itens do you have and how large is  inputSet     when you finish.
You should make inputSet a local variable in the reduce method since you
are not using
its contents later,
  ALkso with the publixhed code that set will expand forever since you do
not clear it after the reduce method and that will surely run you out of
memory

On Mon, Jan 23, 2012 at 12:29 PM, Ahmed Abdeen Hamed <
[EMAIL PROTECTED]> wrote:

> Hello friends,
>
> I wrote a reduce() that receives a large dataset as a text values from the
> map(). The purpose of the reduce() is to compute the distance between each
> item in the values text. When I do, I run out of memory. I tried to
> increase the heap size but that didn't scale either. I am wondering if
> there is a way that I can distribute the reduce() to get it to scale. If
> this is possible, can you kindly share your idea?
> Please note, it is crucial for the values to be passed together in the
> fashion that I am doing, so they can be clustered into groups.
>
> Here is what the reduce() looks like:
>
>
>
> public static class BrandClusteringReducer extends Reducer<Text, Text,
> Text, Text> {
>  Text key = new Text("1");
>
>         Set<String> inputSet = new HashSet<String>();
>         StringBuilder clusterBuilder = new StringBuilder();
>         Set<Set<String>> clClustering = null;
>         Text group = new Text();
>
>         // Complete-Link Clusterer
>         HierarchicalClusterer<String> clClusterer = new
> CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
>         String[] brandsList = null;
>  public void reduce(Text productID, Iterable<Text> brandNames, Context
> context) throws IOException, InterruptedException {
>  for(Text brand: brandNames){
> inputSet.add(brand.toString());
>  }
>         // perform clustering on the inputSet
>         clClustering = clClusterer.cluster(inputSet);
>
>         Iterator<Set<String>> itr = clClustering.iterator();
>         while(itr.hasNext()){
>
>          Set<String> brandsSet = itr.next();
>          clusterBuilder.append("[");
>          for(String aBrand: brandsSet){
>          clusterBuilder.append(aBrand + ",");
>          }
>          clusterBuilder.append("]");
>         }
>         group.set(clusterBuilder.toString());
>         clusterBuilder = new StringBuilder();
>         context.write(key, group);
>
>  }
> }
>
>
>
> Thanks,
> -Ahmed
>

--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com