Hi all,
I have ensured that my mapper produces a unique key for every value it writes
and further more that each map() call only writes one value. I note here
that the value is a custom for which I handle the Writable interface methods.
I realize that it isn't very real world to have (well, want) no combining done
prior to reducing, but I'm still getting my feet wet.
When the reducer runs, I expected to see one reduce() call for every map()
call, and I do. However, the value I get is the composite of all the
reduce() calls that came before it.
So, for example, the mapper gets data like this :
ID, Name, Type, Other stuff...
A000, Cream, Group, ...
B231, Led Zeppelin, Group, ...
A044, Liberace, Individual, ...
ID is the external key from the source data and is guaranteed to be unique.
When I map it, I create a container for the row data and output that container
with all the data from that row only and use the ID field as a key.
Since the key is always unique I expected the sort/shuffle step to never
coalesce any two values. So I expected my reduce() method to be called once
per mapped input row, and it is.
The problem is, as each row is processed, the reducer sees a set of cumulative
value data instead of a container with a row of data in it. So the 'value'
parameter to reduce always has the information from previous reduce steps.
For example, given the data above :
1st Reducer Call :
Key = A000
Value =
Container :
(object 1) : Name = Cream, Type = Group, MBID = A000, ...
2nd Reducer Call :
Key = B231
Value =
Container :
(object 1) : Name = Led Zeppelin, Type = Group, MBID = B231, ...
(object 2) : Name = Cream, Type = Group, MBID = A000, ...
So the second reduce call has data in it from the first reduce call. Very
strange! At a guess I would say the reducer is re-using the object when it
reads the objects back from the mapping step. I dunno..
If anyone has any ideas, I'm open to suggestions. 0.20.2-cdh3u0
Thanks!
R