|
|
-
Re: Cross JoinEdmund Kohlwey 2009-11-10, 14:56
Thanks to all who commented on this. I think there was some confusion
over what I was trying to do: indeed there was no common key between the two tables to join on, which made all the methods I investigated either inappropriate or inefficient. In the end I decided to write my own join class. It can be written in a reducer or a mapper. While the reducer implementation is a bit cleaner, the mapper implementation provides (theoretically) better distributed processing. For those who are interested, the basic algorithm is: x is defined as the cross product of two vectors proc crossproduct: Allow mapreduce to partition the left side of the input on each mapper let left_i = save all the left side key/value pairs that are processed on that node in cleanup (or at the end of the reduce) : let right = open the right side of the join on each node through hdfs for each pair of pairs in left_i x right: if transform (pair) !=null emit transform (pair) else continue endfor end on each end proc The important On 11/5/09 1:15 PM, Ashutosh Chauhan wrote: > Hi Edmund, > > If you can prepare your dataset in a way org.apache.hadoop.mapred.join > requires, then it might be an efficient way to do joins in your case. IMHO > though requirements placed by it though are pretty restrictive. Also, > instead of reinventing the wheel, I would also suggest you to take a look > how Pig tries to solve "joining large dataset" problem. It has infact four > different join algorithms implemented and one or more them should satisfy > your requirements. It seems to me merge-join of Pig is well suited in your > case. Its only requirement is it wants dataset to be sorted on both sides. > Datasets need not to be equipartitioned, need not to have same number of > partitions etc. You said that sorting the dataset is pain in your case. > Pig's orderby is quite sophisticated and performs sorting rather quite > efficiently. If indeed doing sort is not an option, then you may want to > consider hash join or skewed join of Pig. > > Joins in Pig are explained at high-level here: > http://squarecog.wordpress.com/2009/11/03/apache-pig-apittsburgh-hadoop-user-group/ > > Hope it helps, > Ashutosh > > On Thu, Nov 5, 2009 at 06:19, Jason Venner<[EMAIL PROTECTED]> wrote: > > >> Look at the join package in map reduce, it provides this functionality >> quite >> cleaning, for ordered datasets that have the same partitioning. >> org.apache.hadoop.mapred.join in hadoop 19 >> >> On Wed, Nov 4, 2009 at 6:52 AM, Edmund Kohlwey<[EMAIL PROTECTED]> wrote: >> >> >>> Hi, >>> I'm looking for an efficient way to do a cross join. I've gone through a >>> few implementations, and I wanted to seek some advice before attempting >>> another. The join is a "large collection to large collection" - so >>> >> there's >> >>> no trick optimizations like downloading one side of the join on each node >>> (ie. map side join). The output of the join will be sparse, (its >>> >> basically >> >>> matching a large collection of regexes to a large collection of strings), >>> but because of the nature of the data there's not really any way to >>> pre-process either side of the join. >>> >>> 1. Naive approach - on a single node, iterate over both collections, >>> resulting in reading the "left" file 1 times and the right file n times - >>> >> I >> >>> know this is bad. >>> 2. Indexed approach - index data item with a row/col - requires >>> replicating, sorting, and shuffling all the records 2 times - also not >>> >> good. >> >>> This actually seemed to perform worse than 1, and resulted in running out >>> >> of >> >>> disk space on the mappers when output was spilled to disk. >>> >>> I'm now considering what to try next. One idea is to improve on 1 by >>> "blocking" the reads, so that the right side of the join is read b times, >>> where b is the number of blocks the left side is split into. |