Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Replicated join: is there a setting to make this better?


Copy link to this message
-
Re: Replicated join: is there a setting to make this better?
Mailing lists don't support attachments. Is JIRA a place we can discuss
this? Based on the outcome we could either classify it an improvement/bug
or "Not a Problem" ?

-Prashant

On Thu, Feb 21, 2013 at 7:02 PM, Aniket Mokashi <[EMAIL PROTECTED]> wrote:

> Thanks Johnny. I am not sure how to post these images on mailing lists! :(
>
>
> On Thu, Feb 21, 2013 at 6:30 PM, Johnny Zhang <[EMAIL PROTECTED]>
> wrote:
>
> > Hi, Aniket:
> > your image is blank :) not sure if this only happens to me though.
> >
> > Johnny
> >
> >
> > On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED]>
> > wrote:
> >
> > > I think the email was filtered out. Resending.
> > >
> > >
> > > ---------- Forwarded message ----------
> > > From: Aniket Mokashi <[EMAIL PROTECTED]>
> > > Date: Wed, Feb 20, 2013 at 1:18 PM
> > > Subject: Replicated join: is there a setting to make this better?
> > > To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
> > >
> > >
> > > Hi devs,
> > >
> > > I was looking into limitations of size/records for fragment replicated
> > > join (map join) in pig. To test that I loaded a map (aka fragment) of
> > longs
> > > in an alias to join it with other alias which had few other columns.
> > With a
> > > map file of 50mb I saw GC Overheads on the mappers. I took a heap dump
> of
> > > mapper to look into whats causing the GC Overheads and found that its
> the
> > > memory footprint of fragment itself was high.
> > >
> > > [image: Inline image 1]
> > >
> > > Note, the hashmap was able to only load about 1.8 million records-
> > > [image: Inline image 2]
> > > Reason was that every map record has an overhead of about 1.5kb. Most
> of
> > > it is part of retained heap, but it needs to be garbage collected.
> > > [image: Inline image 3]
> > >
> > > So, it turns out-
> > >
> > > Size of heap required by a map join from above = 1.5 KB * Number of
> > > records + Size of input (uncompressed databytearray)... (assuming the
> key
> > > is a long).
> > >
> > > So, to run your replicated join, you need to satisfy following
> criteria:
> > >
> > > *1.5 KB * Number of records + Size of input (uncompressed) < estimated
> > > free memory in the mapper (total heap - io.sort.mb - some minor
> constant
> > > etc.)*
> > >
> > > Is that a right conclusion? Is there a setting/way to make this better?
> > >
> > > Thanks,
> > >
> > > Aniket
> > >
> > > *
> > > *
> > >
> > >
> > >
> > > --
> > > "...:::Aniket:::... Quetzalco@tl"
> > >
> >
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB