Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> distributing a time consuming single reduce task


Copy link to this message
-
Re: distributing a time consuming single reduce task
It sounds like the  HierarchicalClusterer  whatever that is is doing what
a collection of reducers should be doing - try to restructure the job so
that the clustering is done more in the sort step allowing the reducer to
simply collect clusters - the cluster method needs to be
rearchitected to lean more heavily on map-reduce

On Mon, Jan 23, 2012 at 12:57 PM, Ahmed Abdeen Hamed <
[EMAIL PROTECTED]> wrote:

> Thanks very much for the valuable tips! I made the changes that you
> pointed. I am unclear on how to handle that many items all at once without
> putting them all in memory. I can split the file into a few files which
> could be helpful but I could also be splitting a group into two different
> files. To answer your question about how many elements I have in memory,
> there are 871671 items.
>
> Below is how the reduce () looks like after I followed your suggestions
> which still ran out of memory. I would kindly appreciate a few more tips
> before I can try splitting the files. It feels like it is against the
> spirit of Hadoop.
>
> public static class BrandClusteringReducer extends Reducer<Text, Text,
> Text, Text> {
>         // Complete-Link Clusterer
>         HierarchicalClusterer<String> clClusterer = new
> CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
> public void reduce(Text productID, Iterable<Text> brandNames, Context
> context) throws IOException, InterruptedException {
>  Text key = new Text("1");
>         Set<Set<String>> clClustering = null;
>         Text group = new Text();
>  Set<String> inputSet = new HashSet<String>();
>  StringBuilder clusterBuilder = new StringBuilder();
>  for(Text brand: brandNames){
> inputSet.add(brand.toString());
> }
>          // perform clustering on the inputSet
>         clClustering = clClusterer.cluster(inputSet);
>
>         Iterator<Set<String>> itr = clClustering.iterator();
>         while(itr.hasNext()){
>
>          Set<String> brandsSet = itr.next();
>          clusterBuilder.append("[");
>          for(String aBrand: brandsSet){
>          clusterBuilder.append(aBrand + ",");
>          }
>          clusterBuilder.append("]");
>         }
>         group.set(clusterBuilder.toString());
>         clusterBuilder = new StringBuilder();
>         context.write(key, group);
>         inputSet = null;
>         clusterBuilder = null;
> }
>  }
>
>
>
>
>
>
> On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <[EMAIL PROTECTED]>wrote:
>
>> In general keeping the values you iterate through in memory in the
>> inputSet   is a bad idea -
>> How many itens do you have and how large is  inputSet     when you finish.
>> You should make inputSet a local variable in the reduce method since you
>> are not using
>> its contents later,
>>   ALkso with the publixhed code that set will expand forever since you do
>> not clear it after the reduce method and that will surely run you out of
>> memory
>>
>
--
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB