Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Replace join with custom implementation


Copy link to this message
-
Re: Replace join with custom implementation
join BIG by key, SMALL by key using 'replicated';
On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:

> Hi. I've met a problem wth replicated join in pig 0.11
> I have two relations:
> BIG (3-6GB) and SMALL (100MB)
> I do join them on four integer fields.
> It takes  up to 30 minutes to join them.
>
> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
> 32 cores on each TaskTracker.
>
> So our hardware is really powerful.
>
> I've ran a part of join locally and met terrible situation:
> 50% of heap:
> is Integers,
> arrays of integers these integers
> and ArrayLists for arrays with integers.
>
> GC overhead limit happens. The same happend on cluster. I did raise Xms,
> Xms on cluster and problem is gone.
>
> Anyway, joining 6GB/18 and 00Mb  for 30 minutes is toooooo much.
> I would like to reiplement replicated join.
> How can I do it?
>