|
|
-
distributing a time consuming single reduce task
Ahmed Abdeen Hamed 2012-01-23, 20:29
Hello friends,
I wrote a reduce() that receives a large dataset as a text values from the map(). The purpose of the reduce() is to compute the distance between each item in the values text. When I do, I run out of memory. I tried to increase the heap size but that didn't scale either. I am wondering if there is a way that I can distribute the reduce() to get it to scale. If this is possible, can you kindly share your idea? Please note, it is crucial for the values to be passed together in the fashion that I am doing, so they can be clustered into groups.
Here is what the reduce() looks like:
public static class BrandClusteringReducer extends Reducer<Text, Text, Text, Text> { Text key = new Text("1");
Set<String> inputSet = new HashSet<String>(); StringBuilder clusterBuilder = new StringBuilder(); Set<Set<String>> clClustering = null; Text group = new Text();
// Complete-Link Clusterer HierarchicalClusterer<String> clClusterer = new CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE); String[] brandsList = null; public void reduce(Text productID, Iterable<Text> brandNames, Context context) throws IOException, InterruptedException { for(Text brand: brandNames){ inputSet.add(brand.toString()); } // perform clustering on the inputSet clClustering = clClusterer.cluster(inputSet);
Iterator<Set<String>> itr = clClustering.iterator(); while(itr.hasNext()){
Set<String> brandsSet = itr.next(); clusterBuilder.append("["); for(String aBrand: brandsSet){ clusterBuilder.append(aBrand + ","); } clusterBuilder.append("]"); } group.set(clusterBuilder.toString()); clusterBuilder = new StringBuilder(); context.write(key, group);
} }
Thanks, -Ahmed
-
Re: distributing a time consuming single reduce task
Steve Lewis 2012-01-23, 20:41
In general keeping the values you iterate through in memory in the inputSet is a bad idea - How many itens do you have and how large is inputSet when you finish. You should make inputSet a local variable in the reduce method since you are not using its contents later, ALkso with the publixhed code that set will expand forever since you do not clear it after the reduce method and that will surely run you out of memory
On Mon, Jan 23, 2012 at 12:29 PM, Ahmed Abdeen Hamed < [EMAIL PROTECTED]> wrote:
> Hello friends, > > I wrote a reduce() that receives a large dataset as a text values from the > map(). The purpose of the reduce() is to compute the distance between each > item in the values text. When I do, I run out of memory. I tried to > increase the heap size but that didn't scale either. I am wondering if > there is a way that I can distribute the reduce() to get it to scale. If > this is possible, can you kindly share your idea? > Please note, it is crucial for the values to be passed together in the > fashion that I am doing, so they can be clustered into groups. > > Here is what the reduce() looks like: > > > > public static class BrandClusteringReducer extends Reducer<Text, Text, > Text, Text> { > Text key = new Text("1"); > > Set<String> inputSet = new HashSet<String>(); > StringBuilder clusterBuilder = new StringBuilder(); > Set<Set<String>> clClustering = null; > Text group = new Text(); > > // Complete-Link Clusterer > HierarchicalClusterer<String> clClusterer = new > CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE); > String[] brandsList = null; > public void reduce(Text productID, Iterable<Text> brandNames, Context > context) throws IOException, InterruptedException { > for(Text brand: brandNames){ > inputSet.add(brand.toString()); > } > // perform clustering on the inputSet > clClustering = clClusterer.cluster(inputSet); > > Iterator<Set<String>> itr = clClustering.iterator(); > while(itr.hasNext()){ > > Set<String> brandsSet = itr.next(); > clusterBuilder.append("["); > for(String aBrand: brandsSet){ > clusterBuilder.append(aBrand + ","); > } > clusterBuilder.append("]"); > } > group.set(clusterBuilder.toString()); > clusterBuilder = new StringBuilder(); > context.write(key, group); > > } > } > > > > Thanks, > -Ahmed >
-- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
-
Re: distributing a time consuming single reduce task
Ahmed Abdeen Hamed 2012-01-23, 20:57
Thanks very much for the valuable tips! I made the changes that you pointed. I am unclear on how to handle that many items all at once without putting them all in memory. I can split the file into a few files which could be helpful but I could also be splitting a group into two different files. To answer your question about how many elements I have in memory, there are 871671 items.
Below is how the reduce () looks like after I followed your suggestions which still ran out of memory. I would kindly appreciate a few more tips before I can try splitting the files. It feels like it is against the spirit of Hadoop.
public static class BrandClusteringReducer extends Reducer<Text, Text, Text, Text> { // Complete-Link Clusterer HierarchicalClusterer<String> clClusterer = new CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE); public void reduce(Text productID, Iterable<Text> brandNames, Context context) throws IOException, InterruptedException { Text key = new Text("1"); Set<Set<String>> clClustering = null; Text group = new Text(); Set<String> inputSet = new HashSet<String>(); StringBuilder clusterBuilder = new StringBuilder(); for(Text brand: brandNames){ inputSet.add(brand.toString()); } // perform clustering on the inputSet clClustering = clClusterer.cluster(inputSet);
Iterator<Set<String>> itr = clClustering.iterator(); while(itr.hasNext()){
Set<String> brandsSet = itr.next(); clusterBuilder.append("["); for(String aBrand: brandsSet){ clusterBuilder.append(aBrand + ","); } clusterBuilder.append("]"); } group.set(clusterBuilder.toString()); clusterBuilder = new StringBuilder(); context.write(key, group); inputSet = null; clusterBuilder = null; } } On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> In general keeping the values you iterate through in memory in the > inputSet is a bad idea - > How many itens do you have and how large is inputSet when you finish. > You should make inputSet a local variable in the reduce method since you > are not using > its contents later, > ALkso with the publixhed code that set will expand forever since you do > not clear it after the reduce method and that will surely run you out of > memory >
-
Re: distributing a time consuming single reduce task
Steve Lewis 2012-01-24, 02:09
It sounds like the HierarchicalClusterer whatever that is is doing what a collection of reducers should be doing - try to restructure the job so that the clustering is done more in the sort step allowing the reducer to simply collect clusters - the cluster method needs to be rearchitected to lean more heavily on map-reduce
On Mon, Jan 23, 2012 at 12:57 PM, Ahmed Abdeen Hamed < [EMAIL PROTECTED]> wrote:
> Thanks very much for the valuable tips! I made the changes that you > pointed. I am unclear on how to handle that many items all at once without > putting them all in memory. I can split the file into a few files which > could be helpful but I could also be splitting a group into two different > files. To answer your question about how many elements I have in memory, > there are 871671 items. > > Below is how the reduce () looks like after I followed your suggestions > which still ran out of memory. I would kindly appreciate a few more tips > before I can try splitting the files. It feels like it is against the > spirit of Hadoop. > > public static class BrandClusteringReducer extends Reducer<Text, Text, > Text, Text> { > // Complete-Link Clusterer > HierarchicalClusterer<String> clClusterer = new > CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE); > public void reduce(Text productID, Iterable<Text> brandNames, Context > context) throws IOException, InterruptedException { > Text key = new Text("1"); > Set<Set<String>> clClustering = null; > Text group = new Text(); > Set<String> inputSet = new HashSet<String>(); > StringBuilder clusterBuilder = new StringBuilder(); > for(Text brand: brandNames){ > inputSet.add(brand.toString()); > } > // perform clustering on the inputSet > clClustering = clClusterer.cluster(inputSet); > > Iterator<Set<String>> itr = clClustering.iterator(); > while(itr.hasNext()){ > > Set<String> brandsSet = itr.next(); > clusterBuilder.append("["); > for(String aBrand: brandsSet){ > clusterBuilder.append(aBrand + ","); > } > clusterBuilder.append("]"); > } > group.set(clusterBuilder.toString()); > clusterBuilder = new StringBuilder(); > context.write(key, group); > inputSet = null; > clusterBuilder = null; > } > } > > > > > > > On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <[EMAIL PROTECTED]>wrote: > >> In general keeping the values you iterate through in memory in the >> inputSet is a bad idea - >> How many itens do you have and how large is inputSet when you finish. >> You should make inputSet a local variable in the reduce method since you >> are not using >> its contents later, >> ALkso with the publixhed code that set will expand forever since you do >> not clear it after the reduce method and that will surely run you out of >> memory >> > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
-
Re: distributing a time consuming single reduce task
Ahmed Abdeen Hamed 2012-01-24, 02:49
Thanks very much Steve!
The clustering part of the code is really a blackbox and there isn't much to do as far as restructuring. I ended up breaking the big input file into smaller ones and I am letting it running on the cluster. I will know in the morning if it successfully or not. But, I will consider using Mahout for clustering since it is built-in with the mapreduce. I will let you know how that goes if you are interested.
Thanks very much once again for your kind responses! -Ahmed On Mon, Jan 23, 2012 at 9:09 PM, Steve Lewis <[EMAIL PROTECTED]> wrote:
> It sounds like the HierarchicalClusterer whatever that is is doing what > a collection of reducers should be doing - try to restructure the job so > that the clustering is done more in the sort step allowing the reducer to > simply collect clusters - the cluster method needs to be > rearchitected to lean more heavily on map-reduce >
|
|