-Re: Merge sorting reduce output files
Niels Basjes 2012-03-01, 14:23
On Thu, Mar 1, 2012 at 00:07, Robert Evans <[EMAIL PROTECTED]> wrote:
> Sorry it has taken me so long to respond. Today has been a very crazy
> I am just guessing what your algorithm is for auto-complete.
What we have has a lot more features. Yet the basic idea of what we have is
similar enough to what you describe for this discussion.
> If we want the keys to come out in sorted order, we need to have a
> sequence file with the partition keys for the total order partitioner.
> TeraSort generates a partition file by getting ....
> This only really works for Terasort because it assumes that all of the
> partitions are more or less random already.
And that is something I don't have.
> This is the case for the output of a typical map/reduce job where the
> reduce does not change the keys passed in and the output of the reducer is
> less then a block in size. That sure sounds like what wordcount does to
> me. The only real way to get around that is to do it as part of a
> map/reduce job, and do some random sampling instead of reading the first N.
> It should be a map/reduce job because it is going to be reading a lot more
> data then TeraSort’s partition generation code. In this case you would
> have a second M/R job that runs after the first and randomly samples
> words/phrases to work on. It would then generate the increasing long
> phrases and send them all to a single reducer that would buffer them up,
> and when the Reducer has no more input it would output every Nth key so
> that you get the proper number of partitions for the Reducers. You could
> sort these keys yourself to be sure, but they should come in in sorted
> order so why bother resorting.
> If my assumptions are totally wrong here please let me know.
I've had a discussion with some coworkers and we came to a possible
solution that is very closely related to your idea.
Because this is a job that runs periodically we think we can assume the
distribution of the dataset will have a similar "shape" from one run to the
If this assumption holds we can:
1) Create a job that takes the output of run 1 and create a aggregate that
can be used to partition the dataset
2) Use the partitioning dataset from '1)' to distribute the processing for
the next run.
Thanks for your suggestions.
Best regards / Met vriendelijke groeten,