Thanks very much for the valuable tips! I made the changes that you
pointed. I am unclear on how to handle that many items all at once without
putting them all in memory. I can split the file into a few files which
could be helpful but I could also be splitting a group into two different
files. To answer your question about how many elements I have in memory,
there are 871671 items.
Below is how the reduce () looks like after I followed your suggestions
which still ran out of memory. I would kindly appreciate a few more tips
before I can try splitting the files. It feels like it is against the
spirit of Hadoop.
public static class BrandClusteringReducer extends Reducer<Text, Text,
Text, Text> {
// Complete-Link Clusterer
HierarchicalClusterer<String> clClusterer = new
CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
public void reduce(Text productID, Iterable<Text> brandNames, Context
context) throws IOException, InterruptedException {
Text key = new Text("1");
Set<Set<String>> clClustering = null;
Text group = new Text();
Set<String> inputSet = new HashSet<String>();
StringBuilder clusterBuilder = new StringBuilder();
for(Text brand: brandNames){
inputSet.add(brand.toString());
}
// perform clustering on the inputSet
clClustering = clClusterer.cluster(inputSet);
Iterator<Set<String>> itr = clClustering.iterator();
while(itr.hasNext()){
Set<String> brandsSet = itr.next();
clusterBuilder.append("[");
for(String aBrand: brandsSet){
clusterBuilder.append(aBrand + ",");
}
clusterBuilder.append("]");
}
group.set(clusterBuilder.toString());
clusterBuilder = new StringBuilder();
context.write(key, group);
inputSet = null;
clusterBuilder = null;
}
}
On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <[email protected]> wrote:
> In general keeping the values you iterate through in memory in the
> inputSet is a bad idea -
> How many itens do you have and how large is inputSet when you finish.
> You should make inputSet a local variable in the reduce method since you
> are not using
> its contents later,
> ALkso with the publixhed code that set will expand forever since you do
> not clear it after the reduce method and that will surely run you out of
> memory
>