Re: distributing a time consuming single reduce task

Steve Lewis Mon, 23 Jan 2012 18:09:35 -0800

 It sounds like the  HierarchicalClusterer  whatever that is is doing what
a collection of reducers should be doing - try to restructure the job so
that the clustering is done more in the sort step allowing the reducer to
simply collect clusters - the cluster method needs to be
rearchitected to lean more heavily on map-reduce




On Mon, Jan 23, 2012 at 12:57 PM, Ahmed Abdeen Hamed <
[email protected]> wrote:

> Thanks very much for the valuable tips! I made the changes that you
> pointed. I am unclear on how to handle that many items all at once without
> putting them all in memory. I can split the file into a few files which
> could be helpful but I could also be splitting a group into two different
> files. To answer your question about how many elements I have in memory,
> there are 871671 items.
>
> Below is how the reduce () looks like after I followed your suggestions
> which still ran out of memory. I would kindly appreciate a few more tips
> before I can try splitting the files. It feels like it is against the
> spirit of Hadoop.
>
> public static class BrandClusteringReducer extends Reducer<Text, Text,
> Text, Text> {
>         // Complete-Link Clusterer
>         HierarchicalClusterer<String> clClusterer = new
> CompleteLinkClusterer<String>(MAX_DISTANCE, EDIT_DISTANCE);
> public void reduce(Text productID, Iterable<Text> brandNames, Context
> context) throws IOException, InterruptedException {
>  Text key = new Text("1");
>         Set<Set<String>> clClustering = null;
>         Text group = new Text();
>  Set<String> inputSet = new HashSet<String>();
>  StringBuilder clusterBuilder = new StringBuilder();
>  for(Text brand: brandNames){
> inputSet.add(brand.toString());
> }
>          // perform clustering on the inputSet
>         clClustering = clClusterer.cluster(inputSet);
>
>         Iterator<Set<String>> itr = clClustering.iterator();
>         while(itr.hasNext()){
>
>          Set<String> brandsSet = itr.next();
>          clusterBuilder.append("[");
>          for(String aBrand: brandsSet){
>          clusterBuilder.append(aBrand + ",");
>          }
>          clusterBuilder.append("]");
>         }
>         group.set(clusterBuilder.toString());
>         clusterBuilder = new StringBuilder();
>         context.write(key, group);
>         inputSet = null;
>         clusterBuilder = null;
> }
>  }
>
>
>
>
>
>
> On Mon, Jan 23, 2012 at 3:41 PM, Steve Lewis <[email protected]>wrote:
>
>> In general keeping the values you iterate through in memory in the
>> inputSet   is a bad idea -
>> How many itens do you have and how large is  inputSet     when you finish.
>> You should make inputSet a local variable in the reduce method since you
>> are not using
>> its contents later,
>>   ALkso with the publixhed code that set will expand forever since you do
>> not clear it after the reduce method and that will surely run you out of
>> memory
>>
>


-- 
Steven M. Lewis PhD
4221 105th Ave NE
Kirkland, WA 98033
206-384-1340 (cell)
Skype lordjoe_com

Re: distributing a time consuming single reduce task

Reply via email to