-Re: Cross Join
Edmund Kohlwey 2009-11-10, 14:56
Thanks to all who commented on this. I think there was some confusion
over what I was trying to do: indeed there was no common key between the
two tables to join on, which made all the methods I investigated either
inappropriate or inefficient. In the end I decided to write my own join
class. It can be written in a reducer or a mapper. While the reducer
implementation is a bit cleaner, the mapper implementation provides
(theoretically) better distributed processing. For those who are
interested, the basic algorithm is:
x is defined as the cross product of two vectors
Allow mapreduce to partition the left side of the input
on each mapper
let left_i = save all the left side key/value pairs that are
processed on that node
in cleanup (or at the end of the reduce) : let right = open the
right side of the join on each node through hdfs
for each pair of pairs in left_i x right:
if transform (pair) !=null emit transform (pair)
end on each
On 11/5/09 1:15 PM, Ashutosh Chauhan wrote:
> Hi Edmund,
> If you can prepare your dataset in a way org.apache.hadoop.mapred.join
> requires, then it might be an efficient way to do joins in your case. IMHO
> though requirements placed by it though are pretty restrictive. Also,
> instead of reinventing the wheel, I would also suggest you to take a look
> how Pig tries to solve "joining large dataset" problem. It has infact four
> different join algorithms implemented and one or more them should satisfy
> your requirements. It seems to me merge-join of Pig is well suited in your
> case. Its only requirement is it wants dataset to be sorted on both sides.
> Datasets need not to be equipartitioned, need not to have same number of
> partitions etc. You said that sorting the dataset is pain in your case.
> Pig's orderby is quite sophisticated and performs sorting rather quite
> efficiently. If indeed doing sort is not an option, then you may want to
> consider hash join or skewed join of Pig.
> Joins in Pig are explained at high-level here:
> Hope it helps,
> On Thu, Nov 5, 2009 at 06:19, Jason Venner<[EMAIL PROTECTED]> wrote:
>> Look at the join package in map reduce, it provides this functionality
>> cleaning, for ordered datasets that have the same partitioning.
>> org.apache.hadoop.mapred.join in hadoop 19
>> On Wed, Nov 4, 2009 at 6:52 AM, Edmund Kohlwey<[EMAIL PROTECTED]> wrote:
>>> I'm looking for an efficient way to do a cross join. I've gone through a
>>> few implementations, and I wanted to seek some advice before attempting
>>> another. The join is a "large collection to large collection" - so
>>> no trick optimizations like downloading one side of the join on each node
>>> (ie. map side join). The output of the join will be sparse, (its
>>> matching a large collection of regexes to a large collection of strings),
>>> but because of the nature of the data there's not really any way to
>>> pre-process either side of the join.
>>> 1. Naive approach - on a single node, iterate over both collections,
>>> resulting in reading the "left" file 1 times and the right file n times -
>>> know this is bad.
>>> 2. Indexed approach - index data item with a row/col - requires
>>> replicating, sorting, and shuffling all the records 2 times - also not
>>> This actually seemed to perform worse than 1, and resulted in running out
>>> disk space on the mappers when output was spilled to disk.
>>> I'm now considering what to try next. One idea is to improve on 1 by
>>> "blocking" the reads, so that the right side of the join is read b times,
>>> where b is the number of blocks the left side is split into.