Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Re: What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
Haitao Yao 2012-08-09, 14:51
Hey, all, I've submitted the patch for PIG-2182, here's the link: https://issues.apache.org/jira/browse/PIG-2812

I didn't change the data bags to spill into only one file, since that's a very big modification. I just let the DefaultAbstractDataBag spill into one directory and delete the directory recursively with ShutdownHook.
Haitao Yao
[EMAIL PROTECTED]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-12,上午11:04, Haitao Yao 写道:

> Sorry. here's the full mail.
>
>
> > Is your query using combiner ?
> I did know how to explicitly use combiner.
>
> > Can you send the explain plan output ?
> The explain result is in the attachment. It's a little long. link: http://pastebin.com/Q6CvKiP1
>
> <aa.explain>
>
>
> > Does the heap information say how many entries are there in the
> InteralCachedBag's ArrayList ?
> There's 6 big Array lists, and the size is about 372692
> Here's the screen snapshot of the heap dump:
>
> screen snapshot 1: you can see there's 6 big POForeEach instances
>
> <aa.jpg>
>
>
> screen snapshot 2: you can see the memory are mostly retained by the big array list.
>
> <bb.jpg>
>
>
> screen snapshot 3: you can see the big array list is referenced by InternalCachedBag:
>
> <cc.jpg>
>
>
> > What version of pig are you using?
> pig-0.9.2, I've read the latest source code of pig from github, and I don't find any improvements on IntercalCachedBag
>
>
>
> 在 2012-7-12,上午10:58, Jonathan Coveney 写�
溃�>
>> the listserv strips attachments. you'll have to host it somewhere else and
>> link it
>>
>> 2012/7/11 Haitao Yao <[EMAIL PROTECTED]>
>>
>>> Sorry , I sent the mail only to Thejas.
>>>
>>> Resend it for all.
>>>
>>>
>>> Haitao Yao
>>> [EMAIL PROTECTED]
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>>
>>> 在 2012-7-12,上午10:41, Haitao Yao 写道�
�>>>
>>>>
>>>>
>>>>> Is your query using combiner ?
>>>>      I did know how to explicitly use combiner.
>>>>
>>>>> Can you send the explain plan output ?
>>>>      The explain result is in the attachment. It's a little long.
>>>>
>>>> <aa.explain>
>>>>
>>>>> Does the heap information say how many entries are there in the
>>>> InteralCachedBag's ArrayList ?
>>>>      There's 6 big Array lists, and the size is about 372692
>>>>      Here's the screen snapshot of the heap dump:
>>>>
>>>>      screen snapshot 1: you can see there's 6 big POForeEach instances
>>>>
>>>> <aa.jpg>
>>>>
>>>>              screen snapshot 2: you can see the memory are mostly
>>> retained by the big array list.
>>>>
>>>> <bb.jpg>
>>>>
>>>>              screen snapshot 3: you can see the big array list is
>>> referenced by InternalCachedBag:
>>>>
>>>> <cc.jpg>
>>>>
>>>>> What version of pig are you using?
>>>>      pig-0.9.2, I've read the latest source code of pig from github,
>>> and I don't find any improvements on IntercalCachedBag.
>>>>
>>>>
>>>> Haitao Yao
>>>> [EMAIL PROTECTED]
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>>
>>>> 在 2012-7-12,上午8:56, Thejas Nair 写道:
>>>>
>>>>> Haitao,
>>>>> Is your query using combiner ? Can you send the explain plan output ?
>>>>> Does the heap information say how many entries are there in the
>>>>> InteralCachedBag's ArrayList ?
>>>>> What version of pig are you using ?
>>>>>
>>>>>
>>>>> Thanks,
>>>>> Thejas
>>>>>
>>>>>
>>>>> On 7/10/12 11:50 PM, Haitao Yao wrote:
>>>>>> Oh, new discovery: we can not set pig.cachedbag.memusage = 0 because
>>>>>> every time the InternalCachedBag spills, It creates a new tmp file in
>>>>>> java.io.tmpdir. if we set pig.cachedbag.memusage to 0 , every new tuple
>>>>>> added into InternalCachedBag will create a new tmp file. And the tmp
>>>>>> file is deleted on exit.
>>>>>> So , if you're unlucky like me, you will get a OOM Exception caused by
>>>>>> java.io.DeleteOnExitHook!
>>>>>> Here's the evidence:
>>>>>>
>>>>>> God, we really need a full description of how every parameter works.
>>>>>>
>>>>>>
>>>>>>
>>>>>> Haitao Yao
>>>>>> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
溃�>>>>>>>>>>