

Re: Using global reverse lookup tablesTed Dunning 20110411, 17:03
Depending on the function that you want to use, it sounds like you want to
use a self join to compute transposed cooccurrence. That is, it sounds like you want to find all the sets that share elements with X. If you have a binary matrix A that represents your set membership with one row per set and one column per element, then you want to compute A A'. It is common for A to be available in rowmajor form or in the form of pairs. With row major form, the easiest way to compute your desired result is to transpose A in a first mapreduce. With either the transposed matrix a second mapreduce can be used in which the mapper reads all of the sets with a particular element and emits pairs of sets that have a common element. The combiner and reducer are basically a pair counter. This implementation suffers in that it takes time quadratic in the density of the most dense row. It is common to downsample such rows to a reasonable level. Most uses of the cooccurrence matrix don't care about this downsampling and the makes the algorithm much faster. As an alternative, you can compute a matrix decomposition and use that to compute A A'. This can be arranged so as to avoid the downsampling, but the program required is much more complex. The Apache Mahout project has several implementations of such decompositions tuned for different situations. Some implementations use mapreduce, some are sequential. On Mon, Apr 11, 2011 at 9:30 AM, W.P. McNeill <[EMAIL PROTECTED]> wrote: > Are there general approaches to solving this problem? 