Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Replicated join: is there a setting to make this better?


Copy link to this message
-
Re: Replicated join: is there a setting to make this better?
Interesting, I found this in 0.11 documentation:

Fragment replicate joins are experimental; we don't have a strong sense of
how small the small relation must be to fit into memory. In our tests with
a simple query that involves just a JOIN, a relation of up to 100 M can be
used if the process overall gets 1 GB of memory. Please share your
observations and experience with us.

Let me open a jira to share some of the experience I have with this or do
we already have one?

~Aniket
On Thu, Feb 21, 2013 at 7:07 PM, Prashant Kommireddi <[EMAIL PROTECTED]>wrote:

> Mailing lists don't support attachments. Is JIRA a place we can discuss
> this? Based on the outcome we could either classify it an improvement/bug
> or "Not a Problem" ?
>
> -Prashant
>
> On Thu, Feb 21, 2013 at 7:02 PM, Aniket Mokashi <[EMAIL PROTECTED]>
> wrote:
>
> > Thanks Johnny. I am not sure how to post these images on mailing lists!
> :(
> >
> >
> > On Thu, Feb 21, 2013 at 6:30 PM, Johnny Zhang <[EMAIL PROTECTED]>
> > wrote:
> >
> > > Hi, Aniket:
> > > your image is blank :) not sure if this only happens to me though.
> > >
> > > Johnny
> > >
> > >
> > > On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED]>
> > > wrote:
> > >
> > > > I think the email was filtered out. Resending.
> > > >
> > > >
> > > > ---------- Forwarded message ----------
> > > > From: Aniket Mokashi <[EMAIL PROTECTED]>
> > > > Date: Wed, Feb 20, 2013 at 1:18 PM
> > > > Subject: Replicated join: is there a setting to make this better?
> > > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> > > >
> > > >
> > > > Hi devs,
> > > >
> > > > I was looking into limitations of size/records for fragment
> replicated
> > > > join (map join) in pig. To test that I loaded a map (aka fragment) of
> > > longs
> > > > in an alias to join it with other alias which had few other columns.
> > > With a
> > > > map file of 50mb I saw GC Overheads on the mappers. I took a heap
> dump
> > of
> > > > mapper to look into whats causing the GC Overheads and found that its
> > the
> > > > memory footprint of fragment itself was high.
> > > >
> > > > [image: Inline image 1]
> > > >
> > > > Note, the hashmap was able to only load about 1.8 million records-
> > > > [image: Inline image 2]
> > > > Reason was that every map record has an overhead of about 1.5kb. Most
> > of
> > > > it is part of retained heap, but it needs to be garbage collected.
> > > > [image: Inline image 3]
> > > >
> > > > So, it turns out-
> > > >
> > > > Size of heap required by a map join from above = 1.5 KB * Number of
> > > > records + Size of input (uncompressed databytearray)... (assuming the
> > key
> > > > is a long).
> > > >
> > > > So, to run your replicated join, you need to satisfy following
> > criteria:
> > > >
> > > > *1.5 KB * Number of records + Size of input (uncompressed) <
> estimated
> > > > free memory in the mapper (total heap - io.sort.mb - some minor
> > constant
> > > > etc.)*
> > > >
> > > > Is that a right conclusion? Is there a setting/way to make this
> better?
> > > >
> > > > Thanks,
> > > >
> > > > Aniket
> > > >
> > > > *
> > > > *
> > > >
> > > >
> > > >
> > > > --
> > > > "...:::Aniket:::... Quetzalco@tl"
> > > >
> > >
> >
> >
> >
> > --
> > "...:::Aniket:::... Quetzalco@tl"
> >
>

--
"...:::Aniket:::... Quetzalco@tl"