Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Merge sorting reduce output files


Copy link to this message
-
Re: Merge sorting reduce output files
Hi,

On Thu, Mar 1, 2012 at 00:07, Robert Evans <[EMAIL PROTECTED]> wrote:

>  Sorry it has taken me so long to respond.  Today has been a very crazy
> day.
>

No worries.
> I am just guessing what your algorithm is for auto-complete.
>

What we have has a lot more features. Yet the basic idea of what we have is
similar enough to what you describe for this discussion.
>  If we want the keys to come out in sorted order, we need to have a
> sequence file with the partition keys for the total order partitioner.
> TeraSort generates a partition file by getting ....
> This only really works for Terasort because it assumes that all of the
> partitions are more or less random already.
>

And that is something I don't have.
> This is the case for the output of a typical map/reduce job where the
> reduce does not change the keys passed in and the output of the reducer is
> less then a block in size.  That sure sounds like what wordcount does to
> me.  The only real way to get around that is to do it as part of a
> map/reduce job, and do some random sampling instead of reading the first N.
>  It should be a map/reduce job because it is going to be reading a lot more
> data then TeraSort’s partition generation code.  In this case you would
> have a second M/R job that runs after the first and randomly samples
> words/phrases to work on.  It would then generate the increasing long
> phrases and send them all to a single reducer that would buffer them up,
> and when the Reducer has no more input it would output every Nth key so
> that you get the proper number of partitions for the Reducers.  You could
> sort these keys yourself to be sure, but they should come in in sorted
> order so why bother resorting.
>
> If my assumptions are totally wrong here please let me know.
>

I've had a discussion with some coworkers and we came to a possible
solution that is very closely related to your idea.
Because this is a job that runs periodically we think we can assume the
distribution of the dataset will have a similar "shape" from one run to the
next.
If this assumption holds we can:
1) Create a job that takes the output of run 1 and create a aggregate that
can be used to partition the dataset
2) Use the partitioning dataset from '1)' to distribute the processing for
the next run.

Thanks for your suggestions.

--
Best regards / Met vriendelijke groeten,

Niels Basjes
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB