Hi, On Thu, Mar 1, 2012 at 00:07, Robert Evans <[email protected]> wrote:
> Sorry it has taken me so long to respond. Today has been a very crazy > day. > No worries. > I am just guessing what your algorithm is for auto-complete. > What we have has a lot more features. Yet the basic idea of what we have is similar enough to what you describe for this discussion. > If we want the keys to come out in sorted order, we need to have a > sequence file with the partition keys for the total order partitioner. > TeraSort generates a partition file by getting .... > This only really works for Terasort because it assumes that all of the > partitions are more or less random already. > And that is something I don't have. > This is the case for the output of a typical map/reduce job where the > reduce does not change the keys passed in and the output of the reducer is > less then a block in size. That sure sounds like what wordcount does to > me. The only real way to get around that is to do it as part of a > map/reduce job, and do some random sampling instead of reading the first N. > It should be a map/reduce job because it is going to be reading a lot more > data then TeraSort’s partition generation code. In this case you would > have a second M/R job that runs after the first and randomly samples > words/phrases to work on. It would then generate the increasing long > phrases and send them all to a single reducer that would buffer them up, > and when the Reducer has no more input it would output every Nth key so > that you get the proper number of partitions for the Reducers. You could > sort these keys yourself to be sure, but they should come in in sorted > order so why bother resorting. > > If my assumptions are totally wrong here please let me know. > I've had a discussion with some coworkers and we came to a possible solution that is very closely related to your idea. Because this is a job that runs periodically we think we can assume the distribution of the dataset will have a similar "shape" from one run to the next. If this assumption holds we can: 1) Create a job that takes the output of run 1 and create a aggregate that can be used to partition the dataset 2) Use the partitioning dataset from '1)' to distribute the processing for the next run. Thanks for your suggestions. -- Best regards / Met vriendelijke groeten, Niels Basjes
