I was looking into limitations of size/records for fragment replicated join
(map join) in pig. To test that I loaded a map (aka fragment) of longs in
an alias to join it with other alias which had few other columns. With a
map file of 50mb I saw GC Overheads on the mappers. I took a heap dump of
mapper to look into whats causing the GC Overheads and found that its the
memory footprint of fragment itself was high.
[image: Inline image 1]
Note, the hashmap was able to only load about 1.8 million records-
[image: Inline image 2]
Reason was that every map record has an overhead of about 1.5kb. Most of it
is part of retained heap, but it needs to be garbage collected.
[image: Inline image 3]
So, it turns out-
Size of heap required by a map join from above = 1.5 KB * Number of records
+ Size of input (uncompressed databytearray)... (assuming the key is a
So, to run your replicated join, you need to satisfy following criteria:
*1.5 KB * Number of records + Size of input (uncompressed) < estimated free
memory in the mapper (total heap - io.sort.mb - some minor constant etc.)*
Is that a right conclusion? Is there a setting/way to make this better?