Re: Merge sorting reduce output files

Niels Basjes Thu, 01 Mar 2012 06:23:43 -0800

Hi,

On Thu, Mar 1, 2012 at 00:07, Robert Evans <[email protected]> wrote:


>  Sorry it has taken me so long to respond.  Today has been a very crazy
> day.
>

No worries.


> I am just guessing what your algorithm is for auto-complete.
>

What we have has a lot more features. Yet the basic idea of what we have is
similar enough to what you describe for this discussion.


>  If we want the keys to come out in sorted order, we need to have a
> sequence file with the partition keys for the total order partitioner.
> TeraSort generates a partition file by getting ....
> This only really works for Terasort because it assumes that all of the
> partitions are more or less random already.
>

And that is something I don't have.


> This is the case for the output of a typical map/reduce job where the
> reduce does not change the keys passed in and the output of the reducer is
> less then a block in size.  That sure sounds like what wordcount does to
> me.  The only real way to get around that is to do it as part of a
> map/reduce job, and do some random sampling instead of reading the first N.
>  It should be a map/reduce job because it is going to be reading a lot more
> data then TeraSort’s partition generation code.  In this case you would
> have a second M/R job that runs after the first and randomly samples
> words/phrases to work on.  It would then generate the increasing long
> phrases and send them all to a single reducer that would buffer them up,
> and when the Reducer has no more input it would output every Nth key so
> that you get the proper number of partitions for the Reducers.  You could
> sort these keys yourself to be sure, but they should come in in sorted
> order so why bother resorting.
>
> If my assumptions are totally wrong here please let me know.
>

I've had a discussion with some coworkers and we came to a possible
solution that is very closely related to your idea.
Because this is a job that runs periodically we think we can assume the
distribution of the dataset will have a similar "shape" from one run to the
next.
If this assumption holds we can:
1) Create a job that takes the output of run 1 and create a aggregate that
can be used to partition the dataset
2) Use the partitioning dataset from '1)' to distribute the processing for
the next run.

Thanks for your suggestions.

-- 
Best regards / Met vriendelijke groeten,

Niels Basjes

Re: Merge sorting reduce output files

Reply via email to