Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - pig reduce OOM


Copy link to this message
-
Re: pig reduce OOM
Dmitriy Ryaboy 2012-07-10, 15:30
Saw that right after I sent the reply -- yeah, Pig assumes larger
heaps than what you are running.
It would be a nice project for someone to document all the various
Hadoop and Pig buffers that get allocated, and the parameters that
control them, to see where memory goes in mappers and reducers.

D

On Tue, Jul 10, 2012 at 7:49 AM, Haitao Yao <[EMAIL PROTECTED]> wrote:
> I've found the reason: it's InternalCachedBag.
> I've posted all the details in a mail titled: What is the best way to do counting in pig?
> I'm afraid I can not give you the mail link since the mail archiver of apache mailing system still doesn't catch up with that message.
>
>
> Haitao Yao
> [EMAIL PROTECTED]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
>
> 在 2012-7-10,下午10:35, Dmitriy Ryaboy 写道:
>
>> Like I said earlier, if all you are doing is count, the data bag should not be growing. On the reduce side, it'll just be a bag of counts from each reducer. Something else is happening that's preventing the algebraic and accumulative optimizations from kicking in. Can you share a minimal script that reproduces the problem for you?
>>
>> On Jul 9, 2012, at 3:24 AM, Haitao Yao <[EMAIL PROTECTED]> wrote:
>>
>>> seems like Big data big is still a headache for pig.
>>> here's a mail archive  I found : http://mail-archives.apache.org/mod_mbox/pig-user/200806.mbox/%[EMAIL PROTECTED]%3E
>>>
>>> I've tried all the ways I can think of, and none works.
>>> I think I have to play some tricks inside Pig source code.
>>>
>>>
>>>
>>> Haitao Yao
>>> [EMAIL PROTECTED]
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>>
>>> 在 2012-7-9,下午2:18, Haitao Yao 写道:
>>>
>>>> there's also a reason of the OOM: I group the data by all , and the parallelism is 1, With a big data bag, the reducer OOM
>>>>
>>>> after digging into the pig source code ,  I find out that replace the data bag in BinSedesTuple is quite tricky, and maybe will cause other unknown problems…
>>>>
>>>> Is there anybody else encounter the same problem?
>>>>
>>>>
>>>> Haitao Yao
>>>> [EMAIL PROTECTED]
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>>
>>>> 在 2012-7-9,上午11:11, Haitao Yao 写道:
>>>>
>>>>> sorry for the improper statement.
>>>>> The problem is the DataBag.  The BinSedesTuple read full  data of the DataBag. and while use COUNT for the data, it causes OOM.
>>>>> The diagrams also shows that most of the objects is from the ArrayList.
>>>>>
>>>>> I want to reimplement the DataBag that read by BinSedesTuple, it just holds the reference of the data input and read the data one by one while using iterator to access the data.
>>>>>
>>>>> I will give a shot.
>>>>>
>>>>> Haitao Yao
>>>>> [EMAIL PROTECTED]
>>>>> weibo: @haitao_yao
>>>>> Skype:  haitao.yao.final
>>>>>
>>>>> 在 2012-7-6,下午11:06, Dmitriy Ryaboy 写道:
>>>>>
>>>>>> BinSedesTuple is just the tuple, changing it won't do anything about the fact that lots of tuples are being loaded.
>>>>>>
>>>>>> The snippet you provided will not load all the data for computation, since COUNT implements algebraic interface (partial counts will be done on combiners).
>>>>>>
>>>>>> Something else is causing tuples to be materialized. Are you using other UDFs? Can you provide more details on the script? When you run "explain" on "Result", do you see Pig using COUNT$Final, COUNT$Intermediate, etc?
>>>>>>
>>>>>> You can check the "pig.alias" property in the jobconf to identify which relations are being calculated by a given MR job; that might help narrow things down.
>>>>>>
>>>>>> -Dmitriy
>>>>>>
>>>>>>
>>>>>> On Thu, Jul 5, 2012 at 11:44 PM, Haitao Yao <[EMAIL PROTECTED]> wrote:
>>>>>> hi,
>>>>>>   I wrote a pig script that one of the reduces always OOM no matter how I change the parallelism.
>>>>>>       Here's the script snippet:
>>>>>>       Data = group SourceData all;
>>>>>>       Result = foreach Data generate group, COUNt(SourceData);
>>>>>>       store Result into 'XX';
>>>>>>
>>>>>>   I analyzed the dumped java heap,  and find out that the reason is that the reducer load all the data for the foreach and count.