We noticed that the current skewed join supports only 1 skewed table, and
assumes that the second table isn't skewed.
Please review this suggestion for a 2 skewed tables design:
- Sample both tables
- for each skewed key (with many records in at least one table), build a
surrogate key in a GFCross style - e.g. if for this key there are 3M keys
from the left table and 7M from the right table, and there are 100 reducers
available, build GFCross with dimensions of sqrt(100*3/7) and sqrt(100*7/3)
What do you say? Is this a necessary enhancement request? Or is it safe to
assume that only one table will be skewed in each join?
Thanks, Dudu and Ido
Sent from my androido