join BIG by key, SMALL by key using 'replicated';
On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:
> Hi. I've met a problem wth replicated join in pig 0.11
> I have two relations:
> BIG (3-6GB) and SMALL (100MB)
> I do join them on four integer fields.
> It takes up to 30 minutes to join them.
> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
> 32 cores on each TaskTracker.
> So our hardware is really powerful.
> I've ran a part of join locally and met terrible situation:
> 50% of heap:
> is Integers,
> arrays of integers these integers
> and ArrayLists for arrays with integers.
> GC overhead limit happens. The same happend on cluster. I did raise Xms,
> Xms on cluster and problem is gone.
> Anyway, joining 6GB/18 and 00Mb for 30 minutes is toooooo much.
> I would like to reiplement replicated join.
> How can I do it?