Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> pig reduce OOM

Copy link to this message
Re: pig reduce OOM
BinSedesTuple is just the tuple, changing it won't do anything about the
fact that lots of tuples are being loaded.

The snippet you provided will not load all the data for computation, since
COUNT implements algebraic interface (partial counts will be done on

Something else is causing tuples to be materialized. Are you using other
UDFs? Can you provide more details on the script? When you run "explain" on
"Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?

You can check the "pig.alias" property in the jobconf to identify which
relations are being calculated by a given MR job; that might help narrow
things down.

On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[EMAIL PROTECTED]> wrote:

> hi,
> I wrote a pig script that one of the reduces always OOM no matter how I
> change the parallelism.
>         Here's the script snippet:
> Data = group SourceData all;
> Result = foreach Data generate group, COUNt(SourceData);
> store Result into 'XX';
>   I analyzed the dumped java heap,  and find out that the reason is that
> the reducer load all the data for the foreach and count.
> Can I re-implement the BinSedesTuple to avoid reducers load all the data
> for computation?
> Here's the object domination tree:
> here's the jmap result:
> Haitao Yao
> weibo: @haitao_yao
> Skype:  haitao.yao.final