It sounds like the HierarchicalClusterer whatever that is is doing what a collection of reducers should be doing - try to restructure the job so that the clustering is done more in the sort step allowing the reducer to simply collect clusters - the cluster method needs to be rearchitected to lean more heavily on map-reduce
On Mon, Jan 23, 2012 at 12:57 PM, Ahmed Abdeen Hamed < [email protected]> wrote: > Thanks very much for the valuable tips! I made the changes that you > pointed. I am unclear on how to handle that many items all at once without > putting them all in memory. I can split the file into a few files which > could be helpful but I could also be splitting a group into two different > files. To answer your question about how many elements I have in memory, > there are 871671 items. > > Below is how the reduce () looks like after I followed your suggestions > which still ran out of memory. I would kindly appreciate a few more tips > before I can try splitting the files. It feels like it is against the > spirit of Hadoop. > > public static class BrandClusteringReducer extends Reducer<Text, Text, > Text, Text> { > // Complete-Link Clusterer > HierarchicalClusterer<String> clClusterer = new > CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE); > public void reduce(Text productID, Iterable<Text> brandNames, Context > context) throws IOException, InterruptedException { > Text key = new Text("1"); > Set<Set<String>> clClustering = null; > Text group = new Text(); > Set<String> inputSet = new HashSet<String>(); > StringBuilder clusterBuilder = new StringBuilder(); > for(Text brand: brandNames){ > inputSet.add(brand.toString()); > } > // perform clustering on the inputSet > clClustering = clClusterer.cluster(inputSet); > > Iterator<Set<String>> itr = clClustering.iterator(); > while(itr.hasNext()){ > > Set<String> brandsSet = itr.next(); > clusterBuilder.append("["); > for(String aBrand: brandsSet){ > clusterBuilder.append(aBrand + ","); > } > clusterBuilder.append("]"); > } > group.set(clusterBuilder.toString()); > clusterBuilder = new StringBuilder(); > context.write(key, group); > inputSet = null; > clusterBuilder = null; > } > } > > > > > > > On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <[email protected]>wrote: > >> In general keeping the values you iterate through in memory in the >> inputSet is a bad idea - >> How many itens do you have and how large is inputSet when you finish. >> You should make inputSet a local variable in the reduce method since you >> are not using >> its contents later, >> ALkso with the publixhed code that set will expand forever since you do >> not clear it after the reduce method and that will surely run you out of >> memory >> > -- Steven M. Lewis PhD 4221 105th Ave NE Kirkland, WA 98033 206-384-1340 (cell) Skype lordjoe_com
