Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Replace join with custom implementation


Copy link to this message
-
Re: Replace join with custom implementation
join BIG by key, SMALL by key using 'replicated';
On Fri, Aug 2, 2013 at 5:29 AM, Serega Sheypak <[EMAIL PROTECTED]>wrote:

> Hi. I've met a problem wth replicated join in pig 0.11
> I have two relations:
> BIG (3-6GB) and SMALL (100MB)
> I do join them on four integer fields.
> It takes  up to 30 minutes to join them.
>
> Join runs on 18 reducers: -Xmx=3072mb for Java, 128 GB in total
> 32 cores on each TaskTracker.
>
> So our hardware is really powerful.
>
> I've ran a part of join locally and met terrible situation:
> 50% of heap:
> is Integers,
> arrays of integers these integers
> and ArrayLists for arrays with integers.
>
> GC overhead limit happens. The same happend on cluster. I did raise Xms,
> Xms on cluster and problem is gone.
>
> Anyway, joining 6GB/18 and 00Mb  for 30 minutes is toooooo much.
> I would like to reiplement replicated join.
> How can I do it?
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB