Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
MapReduce >> mail # user >> Re: Pairwise Comparison of Large Datasets


Copy link to this message
-
Re: Pairwise Comparison of Large Datasets
Hi Rob,

Thanks for sharing. The approach you take is similar to how Pig
implements the cross product (see the cross section in:
http://ofps.oreilly.com/titles/9781449302641/advanced_pig_latin.html)

What you'll probably find interesting is this article:
Processing Theta-Joins using MapReduce
(http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.229.1890&rep=rep1&type=pdf)
Which features a similar grid like approach, but with some smart tricks.

Also you probably like Jimmy Lin's articles on pairwise similarity in
MR (http://www.umiacs.umd.edu/~jimmylin/publications/index.html).

best, Vasco

On Mon, Dec 31, 2012 at 7:42 PM, Rob Styles <[EMAIL PROTECTED]> wrote:
> Happy New Year :)
>
> Thought some of you might find this useful.
>
> We've developed an approach to doing pairwise comparisons on large datasets
> that doesn't require visibility of the whole dataset at any time. The
> approach brings together pairs for comparison using incrementing coordinates
> to target a value at a cell.
>
> http://dynamicorange.com/2012/12/31/pairwise-comparisons-of-large-datasets/
>
> There is still work to do on making the approach more efficient and trying
> to eliminate a pre-processing step. Help gratefully received.
>
> If there's a toolset already out there for doing this I'd be happy to hear
> about that too!
>
> thanks
>
> rob
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB