Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Merge sorting reduce output files


Copy link to this message
-
Re: Merge sorting reduce output files
Niels Basjes 2012-03-01, 14:23
Hi,

On Thu, Mar 1, 2012 at 00:07, Robert Evans <[EMAIL PROTECTED]> wrote:

>  Sorry it has taken me so long to respond.  Today has been a very crazy
> day.
>

No worries.
> I am just guessing what your algorithm is for auto-complete.
>

What we have has a lot more features. Yet the basic idea of what we have is
similar enough to what you describe for this discussion.
>  If we want the keys to come out in sorted order, we need to have a
> sequence file with the partition keys for the total order partitioner.
> TeraSort generates a partition file by getting ....
> This only really works for Terasort because it assumes that all of the
> partitions are more or less random already.
>

And that is something I don't have.
> This is the case for the output of a typical map/reduce job where the
> reduce does not change the keys passed in and the output of the reducer is
> less then a block in size.  That sure sounds like what wordcount does to
> me.  The only real way to get around that is to do it as part of a
> map/reduce job, and do some random sampling instead of reading the first N.
>  It should be a map/reduce job because it is going to be reading a lot more
> data then TeraSort’s partition generation code.  In this case you would
> have a second M/R job that runs after the first and randomly samples
> words/phrases to work on.  It would then generate the increasing long
> phrases and send them all to a single reducer that would buffer them up,
> and when the Reducer has no more input it would output every Nth key so
> that you get the proper number of partitions for the Reducers.  You could
> sort these keys yourself to be sure, but they should come in in sorted
> order so why bother resorting.
>
> If my assumptions are totally wrong here please let me know.
>

I've had a discussion with some coworkers and we came to a possible
solution that is very closely related to your idea.
Because this is a job that runs periodically we think we can assume the
distribution of the dataset will have a similar "shape" from one run to the
next.
If this assumption holds we can:
1) Create a job that takes the output of run 1 and create a aggregate that
can be used to partition the dataset
2) Use the partitioning dataset from '1)' to distribute the processing for
the next run.

Thanks for your suggestions.

--
Best regards / Met vriendelijke groeten,

Niels Basjes