Thanks very much for the valuable tips! I made the changes that you
pointed. I am unclear on how to handle that many items all at once without
putting them all in memory. I can split the file into a few files which
could be helpful but I could also be splitting a group into two different
files. To answer your question about how many elements I have in memory,
there are 871671 items.

Below is how the reduce () looks like after I followed your suggestions
which still ran out of memory. I would kindly appreciate a few more tips
before I can try splitting the files. It feels like it is against the
spirit of Hadoop.

public static class BrandClusteringReducer extends Reducer<Text, Text,
Text, Text> {
        // Complete-Link Clusterer
        HierarchicalClusterer<String> clClusterer = new
CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
public void reduce(Text productID, Iterable<Text> brandNames, Context
context) throws IOException, InterruptedException {
 Text key = new Text("1");
        Set<Set<String>> clClustering = null;
        Text group = new Text();
Set<String> inputSet = new HashSet<String>();
 StringBuilder clusterBuilder = new StringBuilder();
for(Text brand: brandNames){
inputSet.add(brand.toString());
}
        // perform clustering on the inputSet
        clClustering = clClusterer.cluster(inputSet);

        Iterator<Set<String>> itr = clClustering.iterator();
        while(itr.hasNext()){

         Set<String> brandsSet = itr.next();
         clusterBuilder.append("[");
         for(String aBrand: brandsSet){
         clusterBuilder.append(aBrand + ",");
         }
         clusterBuilder.append("]");
        }
        group.set(clusterBuilder.toString());
        clusterBuilder = new StringBuilder();
        context.write(key, group);
        inputSet = null;
        clusterBuilder = null;
}
}






On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <[email protected]> wrote:

> In general keeping the values you iterate through in memory in the
> inputSet   is a bad idea -
> How many itens do you have and how large is  inputSet     when you finish.
> You should make inputSet a local variable in the reduce method since you
> are not using
> its contents later,
>   ALkso with the publixhed code that set will expand forever since you do
> not clear it after the reduce method and that will surely run you out of
> memory
>

Reply via email to