I keep getting multiple values for unique reduce keys

Rick Ross Sun, 04 Sep 2011 17:42:10 -0700

Hi all, 

I have ensured that my mapper produces a unique key for every value it writes 
and further more that each map() call only writes one value.    I note here 
that the value is a custom for which I handle the Writable interface methods.


I realize that it isn't very real world to have (well, want) no combining done 
prior to reducing, but I'm still getting my feet wet.  

When the reducer runs, I expected to see one reduce() call for every map() 
call, and I do.    However, the value I get is the composite of all the 
reduce() calls that came before it.

So, for example, the mapper gets data like this :

   ID,     Name,          Type,          Other stuff...
   A000,   Cream,         Group,         ...
   B231,   Led Zeppelin,  Group,         ...
   A044,   Liberace,      Individual,    ...


ID is the external key from the source data and is guaranteed to be unique.

When I map it, I create a container for the row data and output that container 
with all the data from that row only and use the ID field as a key.

Since the key is always unique I expected the sort/shuffle step to never 
coalesce any two values.    So I expected my reduce() method to be called once 
per mapped input row, and it is.    

The problem is, as each row is processed, the reducer sees a set of cumulative 
value data instead of a container with a row of data in it.  So the 'value' 
parameter to reduce always has the information from previous reduce steps.  

For example, given the data above : 

1st Reducer Call : 
   Key = A000 
   Value = 
       Container : 
          (object 1) : Name = Cream, Type = Group, MBID = A000, ... 

2nd Reducer Call : 
   Key = B231 
   Value = 
       Container : 
          (object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ... 
          (object 2) : Name = Cream, Type = Group, MBID = A000, ... 

So the second reduce call has data in it from the first reduce call.   Very 
strange!   At a guess I would say the reducer is re-using the object when it 
reads the objects back from the mapping step.  I dunno.. 

If anyone has any ideas, I'm open to suggestions.      0.20.2-cdh3u0

Thanks!

R

I keep getting multiple values for unique reduce keys

Reply via email to