Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Hadoop, mail # user - Cross Join


Copy link to this message
-
Re: Cross Join
Edmund Kohlwey 2009-11-10, 14:56
Thanks to all who commented on this. I think there was some confusion
over what I was trying to do: indeed there was no common key between the
two tables to join on, which made all the methods I investigated either
inappropriate or inefficient. In the end I decided to write my own join
class. It can be written in a reducer or a mapper. While the reducer
implementation is a bit cleaner, the mapper implementation provides
(theoretically) better distributed processing. For those who are
interested, the basic algorithm is:

x is defined as the cross product of two vectors

proc crossproduct:
   Allow mapreduce to partition the left side of the input
   on each mapper
     let left_i = save all the left side key/value pairs that are
processed on that node
     in cleanup (or at the end of the reduce) : let right = open the
right side of the join on each node through hdfs
     for each pair of pairs in left_i x right:
         if transform (pair) !=null emit transform (pair)
         else continue
     endfor
   end on each
end proc

The important

On 11/5/09 1:15 PM, Ashutosh Chauhan wrote:
> Hi Edmund,
>
> If you can prepare your dataset in a way org.apache.hadoop.mapred.join
> requires, then it might be an efficient way to do joins in your case. IMHO
> though requirements placed by it though are pretty restrictive. Also,
> instead of reinventing the wheel, I would also suggest you to take a look
> how Pig tries to solve "joining large dataset" problem. It has infact four
> different join algorithms implemented and one or more them should satisfy
> your requirements. It seems to me merge-join of Pig is well suited in your
> case. Its only requirement is it wants dataset to be sorted on both sides.
> Datasets need not to be equipartitioned, need not to have same number of
> partitions etc. You said that sorting the dataset is pain in your case.
> Pig's orderby is quite sophisticated and performs sorting rather quite
> efficiently. If indeed doing sort is not an option, then you may want to
> consider hash join or skewed join of Pig.
>
> Joins in Pig are explained at high-level here:
> http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/
>
> Hope it helps,
> Ashutosh
>
> On Thu, Nov 5, 2009 at 06:19, Jason Venner<[EMAIL PROTECTED]>  wrote:
>
>    
>> Look at the join package in map reduce, it provides this functionality
>> quite
>> cleaning, for ordered datasets that have the same partitioning.
>> org.apache.hadoop.mapred.join in hadoop 19
>>
>> On Wed, Nov 4, 2009 at 6:52 AM, Edmund Kohlwey<[EMAIL PROTECTED]>  wrote:
>>
>>      
>>> Hi,
>>> I'm looking for an efficient way to do a cross join. I've gone through a
>>> few implementations, and I wanted to seek some advice before attempting
>>> another. The join is a "large collection to large collection" - so
>>>        
>> there's
>>      
>>> no trick optimizations like downloading one side of the join on each node
>>> (ie. map side join). The output of the join will be sparse, (its
>>>        
>> basically
>>      
>>> matching a large collection of regexes to a large collection of strings),
>>> but because of the nature of the data there's not really any way to
>>> pre-process either side of the join.
>>>
>>> 1. Naive approach - on a single node, iterate over both collections,
>>> resulting in reading the "left" file 1 times and the right file n times -
>>>        
>> I
>>      
>>> know this is bad.
>>> 2. Indexed approach - index data item with a row/col - requires
>>> replicating, sorting, and shuffling all the records 2 times - also not
>>>        
>> good.
>>      
>>> This actually seemed to perform worse than 1, and resulted in running out
>>>        
>> of
>>      
>>> disk space on the mappers when output was spilled to disk.
>>>
>>> I'm now considering what to try next. One idea is to improve on 1 by
>>> "blocking" the reads, so that the right side of the join is read b times,
>>> where b is the number of blocks the left side is split into.