On Jun 17, 2013, at 7:24 AM, Ido Hadanny wrote:
> We noticed that the current skewed join supports only 1 skewed table, and
> assumes that the second table isn't skewed.
> Please review this suggestion for a 2 skewed tables design:
> - Sample both tables
> - for each skewed key (with many records in at least one table), build a
> surrogate key in a GFCross style - e.g. if for this key there are 3M keys
> from the left table and 7M from the right table, and there are 100 reducers
> available, build GFCross with dimensions of sqrt(100*3/7) and sqrt(100*7/3)
> What do you say? Is this a necessary enhancement request? Or is it safe to
> assume that only one table will be skewed in each join?
When we built the original skewed join we chose to worry about it only in the case of 1 table being skewed for two reasons:
1) It made joins of skewed tables (even two skewed tables) possible. Before it was possible to have a join where neither table could fit all instances of a given key in memory (as the default hash join implementation requires) and thus the join could not be done. With this implementation you are guaranteed that you can split key instances for one of the inputs and thus complete the join. If it is skewed on both sides the join will still be slow, as you point out.
2) It addressed most of our use cases.
Obviously being able to handle cases where both sides are skewed more efficiently will be very valuable. If you're thinking of contributing in this area I encourage you to file a JIRA with your proposal.
> Thanks, Dudu and Ido
> Sent from my androido