Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Replicated Join and OOM errors


Copy link to this message
-
Re: Replicated Join and OOM errors
Pig does not currently have a way to do this. The development of feature
like this is tracked at - https://issues.apache.org/jira/browse/PIG-2784.
Feel free to add a subtask and take a stab at it.

~Aniket
On Fri, Jul 19, 2013 at 12:58 PM, Mehmet Tepedelenlioglu <
[EMAIL PROTECTED]> wrote:

> You can always split your tables such that same keys end up in same
> splits. Then you replicated join the corresponding splits and take the
> union.
>
> On Jul 19, 2013, at 12:26 PM, Arun Ahuja <[EMAIL PROTECTED]> wrote:
>
> > I have been using a replicated join to join on very large set of data
> with
> > another one that is about 1000x smaller.  Generally seen large
> performance
> > gains.
> >
> > However, they do scale together, so that now  even though the RHS table
> is
> > still 1000x smaller, it is too large to fit into memory.  There will
> happen
> > on only every 20th or so dataset that join is performed on, but I'd like
> to
> > have something robust built to handle this.
> >
> > Is there anyway to setup the replicated join to back to a regular join
> only
> > on memory issues?  Or any type of conditional I could set to check the
> > dataset size first?  Willing to even dig into the Pig could and implement
> > this if anyone has ideas.
> >
> > Thanks
> >
> > Arun
>
>
--
"...:::Aniket:::... Quetzalco@tl"
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB