Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Replicated join: is there a setting to make this better?


+
Aniket Mokashi 2013-02-20, 21:18
+
Johnny Zhang 2013-02-22, 02:30
+
Aniket Mokashi 2013-02-22, 03:02
+
Prashant Kommireddi 2013-02-22, 03:07
+
Aniket Mokashi 2013-02-22, 08:42
Copy link to this message
-
Re: Replicated join: is there a setting to make this better?
One quick way to vastly improve the memory efficiency is to utilize the
SchemaTuple addition.

https://issues.apache.org/jira/browse/PIG-2359

This should cut memory use in half, at least.
2013/2/22 Aniket Mokashi <[EMAIL PROTECTED]>

> Interesting, I found this in 0.11 documentation:
>
> Fragment replicate joins are experimental; we don't have a strong sense of
> how small the small relation must be to fit into memory. In our tests with
> a simple query that involves just a JOIN, a relation of up to 100 M can be
> used if the process overall gets 1 GB of memory. Please share your
> observations and experience with us.
>
> Let me open a jira to share some of the experience I have with this or do
> we already have one?
>
> ~Aniket
>
>
> On Thu, Feb 21, 2013 at 7:07 PM, Prashant Kommireddi <[EMAIL PROTECTED]
> >wrote:
>
> > Mailing lists don't support attachments. Is JIRA a place we can discuss
> > this? Based on the outcome we could either classify it an improvement/bug
> > or "Not a Problem" ?
> >
> > -Prashant
> >
> > On Thu, Feb 21, 2013 at 7:02 PM, Aniket Mokashi <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Thanks Johnny. I am not sure how to post these images on mailing lists!
> > :(
> > >
> > >
> > > On Thu, Feb 21, 2013 at 6:30 PM, Johnny Zhang <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > Hi, Aniket:
> > > > your image is blank :) not sure if this only happens to me though.
> > > >
> > > > Johnny
> > > >
> > > >
> > > > On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED]
> >
> > > > wrote:
> > > >
> > > > > I think the email was filtered out. Resending.
> > > > >
> > > > >
> > > > > ---------- Forwarded message ----------
> > > > > From: Aniket Mokashi <[EMAIL PROTECTED]>
> > > > > Date: Wed, Feb 20, 2013 at 1:18 PM
> > > > > Subject: Replicated join: is there a setting to make this better?
> > > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> > > > >
> > > > >
> > > > > Hi devs,
> > > > >
> > > > > I was looking into limitations of size/records for fragment
> > replicated
> > > > > join (map join) in pig. To test that I loaded a map (aka fragment)
> of
> > > > longs
> > > > > in an alias to join it with other alias which had few other
> columns.
> > > > With a
> > > > > map file of 50mb I saw GC Overheads on the mappers. I took a heap
> > dump
> > > of
> > > > > mapper to look into whats causing the GC Overheads and found that
> its
> > > the
> > > > > memory footprint of fragment itself was high.
> > > > >
> > > > > [image: Inline image 1]
> > > > >
> > > > > Note, the hashmap was able to only load about 1.8 million records-
> > > > > [image: Inline image 2]
> > > > > Reason was that every map record has an overhead of about 1.5kb.
> Most
> > > of
> > > > > it is part of retained heap, but it needs to be garbage collected.
> > > > > [image: Inline image 3]
> > > > >
> > > > > So, it turns out-
> > > > >
> > > > > Size of heap required by a map join from above = 1.5 KB * Number of
> > > > > records + Size of input (uncompressed databytearray)... (assuming
> the
> > > key
> > > > > is a long).
> > > > >
> > > > > So, to run your replicated join, you need to satisfy following
> > > criteria:
> > > > >
> > > > > *1.5 KB * Number of records + Size of input (uncompressed) <
> > estimated
> > > > > free memory in the mapper (total heap - io.sort.mb - some minor
> > > constant
> > > > > etc.)*
> > > > >
> > > > > Is that a right conclusion? Is there a setting/way to make this
> > better?
> > > > >
> > > > > Thanks,
> > > > >
> > > > > Aniket
> > > > >
> > > > > *
> > > > > *
> > > > >
> > > > >
> > > > >
> > > > > --
> > > > > "...:::Aniket:::... Quetzalco@tl"
> > > > >
> > > >
> > >
> > >
> > >
> > > --
> > > "...:::Aniket:::... Quetzalco@tl"
> > >
> >
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB