-Re: Pairwise Comparison of Large Datasets
Vasco Visser 2013-01-03, 00:47
Thanks for sharing. The approach you take is similar to how Pig
implements the cross product (see the cross section in:
What you'll probably find interesting is this article:
Processing Theta-Joins using MapReduce
Which features a similar grid like approach, but with some smart tricks.
Also you probably like Jimmy Lin's articles on pairwise similarity in
On Mon, Dec 31, 2012 at 7:42 PM, Rob Styles <[EMAIL PROTECTED]> wrote:
> Happy New Year :)
> Thought some of you might find this useful.
> We've developed an approach to doing pairwise comparisons on large datasets
> that doesn't require visibility of the whole dataset at any time. The
> approach brings together pairs for comparison using incrementing coordinates
> to target a value at a cell.
> There is still work to do on making the approach more efficient and trying
> to eliminate a pre-processing step. Help gratefully received.
> If there's a toolset already out there for doing this I'd be happy to hear
> about that too!