In general keeping the values you iterate through in memory in the inputSet is a bad idea - How many itens do you have and how large is inputSet when you finish. You should make inputSet a local variable in the reduce method since you are not using its contents later, ALkso with the publixhed code that set will expand forever since you do not clear it after the reduce method and that will surely run you out of memory
On Mon, Jan 23, 2012 at 12:29 PM, Ahmed Abdeen Hamed < [email protected]> wrote: > Hello friends, > > I wrote a reduce() that receives a large dataset as a text values from the > map(). The purpose of the reduce() is to compute the distance between each > item in the values text. When I do, I run out of memory. I tried to > increase the heap size but that didn't scale either. I am wondering if > there is a way that I can distribute the reduce() to get it to scale. If > this is possible, can you kindly share your idea? > Please note, it is crucial for the values to be passed together in the > fashion that I am doing, so they can be clustered into groups. > > Here is what the reduce() looks like: > > > > public static class BrandClusteringReducer extends Reducer<Text, Text, > Text, Text> { > Text key = new Text("1"); > > Set<String> inputSet = new HashSet<String>(); > StringBuilder clusterBuilder = new StringBuilder(); > Set<Set<String>> clClustering = null; > Text group = new Text(); > > // Complete-Link Clusterer > HierarchicalClusterer<String> clClusterer = new > CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE); > String[] brandsList = null; > public void reduce(Text productID, Iterable<Text> brandNames, Context > context) throws IOException, InterruptedException { > for(Text brand: brandNames){ > inputSet.add(brand.toString()); > } > // perform clustering on the inputSet > clClustering = clClusterer.cluster(inputSet); > > Iterator<Set<String>> itr = clClustering.iterator(); > while(itr.hasNext()){ > > Set<String> brandsSet = itr.next(); > clusterBuilder.append("["); > for(String aBrand: brandsSet){ > clusterBuilder.append(aBrand + ","); > } > clusterBuilder.append("]"); > } > group.set(clusterBuilder.toString()); > clusterBuilder = new StringBuilder(); > context.write(key, group); > > } > } > > > > Thanks, > -Ahmed > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
