You can always split your tables such that same keys end up in same splits. Then you replicated join the corresponding splits and take the union.
On Jul 19, 2013, at 12:26 PM, Arun Ahuja <[EMAIL PROTECTED]> wrote:
> I have been using a replicated join to join on very large set of data with
> another one that is about 1000x smaller. Generally seen large performance
> However, they do scale together, so that now even though the RHS table is
> still 1000x smaller, it is too large to fit into memory. There will happen
> on only every 20th or so dataset that join is performed on, but I'd like to
> have something robust built to handle this.
> Is there anyway to setup the replicated join to back to a regular join only
> on memory issues? Or any type of conditional I could set to check the
> dataset size first? Willing to even dig into the Pig could and implement
> this if anyone has ideas.