I am writing to kindly ask for ideas of doing cartesian product in hadoop.
Specifically, now I have two datasets, each of which contains 20million
I want to do cartesian product on these two datasets, comparing lines
The output of each comparison can be mostly filtered by a function ( we do
not output the
whole result of this cartesian product, but only a small part).
I guess one good way is to pass one block from dataset1 and another block
to a mapper, then let the mappers do the product in memory to avoid IO.
Thank you very much.