It is rarely practical to do exhaustive comparisons on datasets of this
The method used is to heuristically prune the cartesian product set and
only examine pairs that have a high likelihood of being near.
This can be done in many ways. Your suggestion of doing a map-side join is
a reasonable one, but it will be much slower than methods where you can
prune the comparisons.
On Thu, Apr 18, 2013 at 9:47 AM, zheyi rong <[EMAIL PROTECTED]> wrote:
> Dear all,
> I am writing to kindly ask for ideas of doing cartesian product in hadoop.
> Specifically, now I have two datasets, each of which contains 20million
> I want to do cartesian product on these two datasets, comparing lines
> The output of each comparison can be mostly filtered by a function ( we do
> not output the
> whole result of this cartesian product, but only a small part).
> I guess one good way is to pass one block from dataset1 and another block
> from dataset2
> to a mapper, then let the mappers do the product in memory to avoid IO.
> Any suggestions?
> Thank you very much.
> Zheyi Rong
zheyi rong 2013-04-19, 11:04
Ajay Srivastava 2013-04-18, 11:45
zheyi rong 2013-04-18, 12:10
Ajay Srivastava 2013-04-18, 15:18
zheyi rong 2013-04-19, 11:02