Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> pig reduce OOM

Copy link to this message
pig reduce OOM
I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
        Here's the script snippet:
Data = group SourceData all;
Result = foreach Data generate group, COUNt(SourceData);
store Result into 'XX';

  I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count.

Can I re-implement the BinSedesTuple to avoid reducers load all the data for computation?

Here's the object domination tree:

here's the jmap result:

Haitao Yao
weibo: @haitao_yao
Skype:  haitao.yao.final