W.P. McNeill 2011-04-11, 16:30
Depending on the function that you want to use, it sounds like you want to
use a self join to compute transposed cooccurrence.
That is, it sounds like you want to find all the sets that share elements
with X. If you have a binary matrix A that represents your set membership
with one row per set and one column per element, then you want to compute A
A'. It is common for A to be available in row-major form or in the form of
pairs. With row major form, the easiest way to compute your desired result
is to transpose A in a first map-reduce. With either the transposed matrix
a second map-reduce can be used in which the mapper reads all of the sets
with a particular element and emits pairs of sets that have a common
element. The combiner and reducer are basically a pair counter.
This implementation suffers in that it takes time quadratic in the density
of the most dense row. It is common to downsample such rows to a reasonable
level. Most uses of the cooccurrence matrix don't care about this
downsampling and the makes the algorithm much faster.
As an alternative, you can compute a matrix decomposition and use that to
compute A A'. This can be arranged so as to avoid the downsampling, but the
program required is much more complex. The Apache Mahout project has
several implementations of such decompositions tuned for different
situations. Some implementations use map-reduce, some are sequential.
On Mon, Apr 11, 2011 at 9:30 AM, W.P. McNeill <[EMAIL PROTECTED]> wrote:
> Are there general approaches to solving this problem?
W.P. McNeill 2011-04-15, 18:45
Ted Dunning 2011-04-15, 21:43