Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Replicated join: is there a setting to make this better?

Copy link to this message
Replicated join: is there a setting to make this better?
Hi devs,

I was looking into limitations of size/records for fragment replicated join
(map join) in pig. To test that I loaded a map (aka fragment) of longs in
an alias to join it with other alias which had few other columns. With a
map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of
mapper to look into whats causing the GC Overheads and found that its the
memory footprint of fragment itself was high.

[image: Inline image 1]

Note, the hashmap was able to only load about 1.8 million records-
[image: Inline image 2]
Reason was that every map record has an overhead of about 1.5kb. Most of it
is part of retained heap, but it needs to be garbage collected.
[image: Inline image 3]

So, it turns out-

Size of heap required by a map join from above = 1.5 KB * Number of records
+ Size of input (uncompressed databytearray)... (assuming the key is a

So, to run your replicated join, you need to satisfy following criteria:

*1.5 KB * Number of records + Size of input (uncompressed) < estimated free
memory in the mapper (total heap - io.sort.mb - some minor constant etc.)*

Is that a right conclusion? Is there a setting/way to make this better?



Johnny Zhang 2013-02-22, 02:30
Aniket Mokashi 2013-02-22, 03:02
Prashant Kommireddi 2013-02-22, 03:07
Aniket Mokashi 2013-02-22, 08:42
Jonathan Coveney 2013-02-22, 09:17