Re: strategies to share information between mapreduce tasks

Jay Vyas Wed, 26 Sep 2012 09:41:11 -0700

The reason this is so rare is that the nature of map/reduce tasks is that
they are orthogonal  i.e. the word count, batch image recognition, tera
sort -- all the things hadoop is famous for are largely orthogonal tasks.
Its much more rare (i think) to see people using hadoop to do traffic
simulations or solve protein folding problems... Because those tasks
require continuous signal integration.


1) First, try to consider rewriting it so that ll communication is replaced
by state variables in a reducer, and choose your keys wisely, so that all
"communication" between machines is obviated by the fact that a single
reducer is receiving all the information relevant for it to do its task.

2) If a small amount of state needs to be preserved or cached in real time
two optimize the situation where two machines might dont have to redo the
same task (i.e. invoke a web service to get a peice of data, or some other
task that needs to be rate limited and not duplicated) then you can use a
fast key value store (like you suggested) like the ones provided by basho (
http://basho.com/) or amazon (Dynamo).

3) If you really need alot of message passing, then then you might be
better of using an inherently more integrated tool like GridGain... which
allows for sophisticated message passing between asynchronously running
processes, i.e.
http://gridgaintech.wordpress.com/2011/01/26/distributed-actors-in-gridgain/.


It seems like there might not be a reliable way to implement a
sophisticated message passing architecutre in hadoop, because the system is
inherently so dynamic, and is built for rapid streaming reads/writes, which
would be stifled by significant communication overhead.

Re: strategies to share information between mapreduce tasks

Reply via email to