Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Pairwise Comparison of Large Datasets

Copy link to this message
Re: Pairwise Comparison of Large Datasets
Hi Rob,

Thanks for sharing. The approach you take is similar to how Pig
implements the cross product (see the cross section in:

What you'll probably find interesting is this article:
Processing Theta-Joins using MapReduce
Which features a similar grid like approach, but with some smart tricks.

Also you probably like Jimmy Lin's articles on pairwise similarity in
MR (http://www.umiacs.umd.edu/~jimmylin/publications/index.html).

best, Vasco

On Mon, Dec 31, 2012 at 7:42 PM, Rob Styles <[EMAIL PROTECTED]> wrote:
> Happy New Year :)
> Thought some of you might find this useful.
> We've developed an approach to doing pairwise comparisons on large datasets
> that doesn't require visibility of the whole dataset at any time. The
> approach brings together pairs for comparison using incrementing coordinates
> to target a value at a cell.
> http://dynamicorange.com/2012/12/31/pairwise-comparisons-of-large-datasets/
> There is still work to do on making the approach more efficient and trying
> to eliminate a pre-processing step. Help gratefully received.
> If there's a toolset already out there for doing this I'd be happy to hear
> about that too!
> thanks
> rob