Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Replicated join: is there a setting to make this better?


Copy link to this message
-
Re: Replicated join: is there a setting to make this better?
Hi, Aniket:
your image is blank :) not sure if this only happens to me though.

Johnny
On Thu, Feb 21, 2013 at 6:08 PM, Aniket Mokashi <[EMAIL PROTECTED]> wrote:

> I think the email was filtered out. Resending.
>
>
> ---------- Forwarded message ----------
> From: Aniket Mokashi <[EMAIL PROTECTED]>
> Date: Wed, Feb 20, 2013 at 1:18 PM
> Subject: Replicated join: is there a setting to make this better?
> To: "[EMAIL PROTECTED]" <[EMAIL PROTECTED]>
>
>
> Hi devs,
>
> I was looking into limitations of size/records for fragment replicated
> join (map join) in pig. To test that I loaded a map (aka fragment) of longs
> in an alias to join it with other alias which had few other columns. With a
> map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of
> mapper to look into whats causing the GC Overheads and found that its the
> memory footprint of fragment itself was high.
>
> [image: Inline image 1]
>
> Note, the hashmap was able to only load about 1.8 million records-
> [image: Inline image 2]
> Reason was that every map record has an overhead of about 1.5kb. Most of
> it is part of retained heap, but it needs to be garbage collected.
> [image: Inline image 3]
>
> So, it turns out-
>
> Size of heap required by a map join from above = 1.5 KB * Number of
> records + Size of input (uncompressed databytearray)... (assuming the key
> is a long).
>
> So, to run your replicated join, you need to satisfy following criteria:
>
> *1.5 KB * Number of records + Size of input (uncompressed) < estimated
> free memory in the mapper (total heap - io.sort.mb - some minor constant
> etc.)*
>
> Is that a right conclusion? Is there a setting/way to make this better?
>
> Thanks,
>
> Aniket
>
> *
> *
>
>
>
> --
> "...:::Aniket:::... Quetzalco@tl"
>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB