Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop >> mail # user >> Using global reverse lookup tables


Copy link to this message
-
Re: Using global reverse lookup tables
Depending on the function that you want to use, it sounds like you want to
use a self join to compute transposed cooccurrence.

That is, it sounds like you want to find all the sets that share elements
with X.  If you have a binary matrix A that represents your set membership
with one row per set and one column per element, then you want to compute A
A'.   It is common for A to be available in row-major form or in the form of
pairs.  With row major form, the easiest way to compute your desired result
is to transpose A in a first map-reduce.  With either the transposed matrix
a second map-reduce can be used in which the mapper reads all of the sets
with a particular element and emits pairs of sets that have a common
element.  The combiner and reducer are basically a pair counter.

This implementation suffers in that it takes time quadratic in the density
of the most dense row.  It is common to downsample such rows to a reasonable
level.  Most uses of the cooccurrence matrix don't care about this
downsampling and the makes the algorithm much faster.

As an alternative, you can compute a matrix decomposition and use that to
compute A A'.  This can be arranged so as to avoid the downsampling, but the
program required is much more complex.  The Apache Mahout project has
several implementations of such decompositions tuned for different
situations.  Some implementations use map-reduce, some are sequential.
On Mon, Apr 11, 2011 at 9:30 AM, W.P. McNeill <[EMAIL PROTECTED]> wrote:

> Are there general approaches to solving this problem?
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB