-Re: Question on Join algorithms in PIG
Johnny Zhang 2013-04-18, 05:33
Beside three implementation you mentioned, there is also the default join,
which needs map and reduce phase.
replicated join doesn't require data sets to be sorted. You just need to
make sure second input is the smaller input.
These two classes are related to merged join:
/** Merge Join indexer is used to generate on the fly index for doing Merge
* It samples first record from every block of right side input.
* and returns tuple in the following format :
* (key0, key1,...,position,splitIndex)
* These tuples are then sorted before being written out to index file on
2. physical logical operator classes is POMergeJoin.java
On Wed, Apr 17, 2013 at 9:59 PM, Yunming Zhang
> I am working on a research project optimizing Join algorithms implemented
> in MapReduce.
> My belief is that Pig currently has three types of Join implemented, the
> Replicated Join, Skewed Join and Merge Join. From my understanding reading
> the documentation, it seems that both Replicated and Merge Join are map
> side Joins and Skewed Join is a reduce side join?
> Overall, I have a few questions,
> 1. Does replicated Join requires the data sets to be sorted? (I know merge
> join requires sorted datasets)
> 2. Can anyone point me to the actual implementation of the Map Reduce
> program that is generated by Pig with these three different kinds of joins?
> Or the code that maps Pig to Hadoop Map Reduce Join algorithm?
> I found the POMergeJoin, POSkewed Join, but I still couldn't figure out
> how the actual MapReduce implementation would look like?