Haitao Yao 2012-07-06, 06:44
BinSedesTuple is just the tuple, changing it won't do anything about the
fact that lots of tuples are being loaded.
The snippet you provided will not load all the data for computation, since
COUNT implements algebraic interface (partial counts will be done on
Something else is causing tuples to be materialized. Are you using other
UDFs? Can you provide more details on the script? When you run "explain" on
"Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
You can check the "pig.alias" property in the jobconf to identify which
relations are being calculated by a given MR job; that might help narrow
On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[EMAIL PROTECTED]> wrote:
> I wrote a pig script that one of the reduces always OOM no matter how I
> change the parallelism.
> Here's the script snippet:
> Data = group SourceData all;
> Result = foreach Data generate group, COUNt(SourceData);
> store Result into 'XX';
> I analyzed the dumped java heap, and find out that the reason is that
> the reducer load all the data for the foreach and count.
> Can I re-implement the BinSedesTuple to avoid reducers load all the data
> for computation?
> Here's the object domination tree:
> here's the jmap result:
> Haitao Yao
> [EMAIL PROTECTED]
> weibo: @haitao_yao
> Skype: haitao.yao.final
Haitao Yao 2012-07-09, 03:11
Haitao Yao 2012-07-09, 06:18
Haitao Yao 2012-07-09, 10:24
Dmitriy Ryaboy 2012-07-10, 14:35
Haitao Yao 2012-07-10, 14:49
Dmitriy Ryaboy 2012-07-10, 15:30