Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce, mail # user - Re: Pairwise Comparison of Large Datasets


Copy link to this message
-
Re: Pairwise Comparison of Large Datasets
Vasco Visser 2013-01-03, 00:47
Hi Rob,

Thanks for sharing. The approach you take is similar to how Pig
implements the cross product (see the cross section in:
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html)

What you'll probably find interesting is this article:
Processing Theta-Joins using MapReduce
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.229.1890&rep=rep1&type=pdf)
Which features a similar grid like approach, but with some smart tricks.

Also you probably like Jimmy Lin's articles on pairwise similarity in
MR (http://www.umiacs.umd.edu/~jimmylin/publications/index.html).

best, Vasco

On Mon, Dec 31, 2012 at 7:42 PM, Rob Styles <[EMAIL PROTECTED]> wrote:
> Happy New Year :)
>
> Thought some of you might find this useful.
>
> We've developed an approach to doing pairwise comparisons on large datasets
> that doesn't require visibility of the whole dataset at any time. The
> approach brings together pairs for comparison using incrementing coordinates
> to target a value at a cell.
>
> http://dynamicorange.com/2012/12/31/pairwise-comparisons-of-large-datasets/
>
> There is still work to do on making the approach more efficient and trying
> to eliminate a pre-processing step. Help gratefully received.
>
> If there's a toolset already out there for doing this I'd be happy to hear
> about that too!
>
> thanks
>
> rob