Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # user >> pig reduce OOM


+
Haitao Yao 2012-07-06, 06:44
+
Dmitriy Ryaboy 2012-07-06, 15:06
+
Haitao Yao 2012-07-09, 03:11
+
Haitao Yao 2012-07-09, 06:18
+
Haitao Yao 2012-07-09, 10:24
Copy link to this message
-
Re: pig reduce OOM
Like I said earlier, if all you are doing is count, the data bag should not be growing. On the reduce side, it'll just be a bag of counts from each reducer. Something else is happening that's preventing the algebraic and accumulative optimizations from kicking in. Can you share a minimal script that reproduces the problem for you?

On Jul 9, 2012, at 3:24 AM, Haitao Yao <[EMAIL PROTECTED]> wrote:

> seems like Big data big is still a headache for pig.
> here's a mail archive  I found : http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%[EMAIL PROTECTED]%3E
>
> I've tried all the ways I can think of, and none works.
> I think I have to play some tricks inside Pig source code.
>
>
>
> Haitao Yao
> [EMAIL PROTECTED]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
>
> 在 2012-7-9,下午2:18, Haitao Yao ��道:
>
>> there's also a reason of the OOM: I group the data by all , and the parallelism is 1, With a big data bag, the reducer OOM
>>
>> after digging into the pig source code ,  I find out that replace the data bag in BinSedesTuple is quite tricky, and maybe will cause other unknown problems…
>>
>> Is there anybody else encounter the same problem?
>>
>>
>> Haitao Yao
>> [EMAIL PROTECTED]
>> weibo: @haitao_yao
>> Skype:  haitao.yao.final
>>
>> 在 2012-7-9,上午11:11, Haitao Yao ���道:
>>
>>> sorry for the improper statement.
>>> The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. and while use COUNT for the data, it causes OOM.
>>> The diagrams also shows that most of the objects is from the ArrayList.
>>>
>>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds the reference of the data input and read the data one by one while using iterator to access the data.
>>>
>>> I will give a shot.
>>>
>>> Haitao Yao
>>> [EMAIL PROTECTED]
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>>
>>> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道:
>>>
>>>> BinSedesTuple is just the tuple, changing it won't do anything about the fact that lots of tuples are being loaded.
>>>>
>>>> The snippet you provided will not load all the data for computation, since COUNT implements algebraic interface (partial counts will be done on combiners).
>>>>
>>>> Something else is causing tuples to be materialized. Are you using other UDFs? Can you provide more details on the script? When you run "explain" on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>>>
>>>> You can check the "pig.alias" property in the jobconf to identify which relations are being calculated by a given MR job; that might help narrow things down.
>>>>
>>>> -Dmitriy
>>>>
>>>>
>>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[EMAIL PROTECTED]> wrote:
>>>> hi,
>>>>    I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
>>>>        Here's the script snippet:
>>>>        Data = group SourceData all;
>>>>        Result = foreach Data generate group, COUNt(SourceData);
>>>>        store Result into 'XX';
>>>>    
>>>>    I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count.
>>>>
>>>>    Can I re-implement the BinSedesTuple to avoid reducers load all the data for computation?
>>>>
>>>> Here's the object domination tree:
>>>>
>>>>
>>>>
>>>> here's the jmap result:
>>>>
>>>>
>>>>
>>>> Haitao Yao
>>>> [EMAIL PROTECTED]
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>>
>>>>
>>>
>>
>
+
Haitao Yao 2012-07-10, 14:49
+
Dmitriy Ryaboy 2012-07-10, 15:30