Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
Haitao Yao 2012-07-12, 02:56
Sorry , I sent the mail only to Thejas.

Resend it for all.
Haitao Yao
[EMAIL PROTECTED]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-12,上午10:41, Haitao Yao 写道:

>
>
> > Is your query using combiner ?
> I did know how to explicitly use combiner.
>
> > Can you send the explain plan output ?
> The explain result is in the attachment. It's a little long.
>
> <aa.explain>
>
> > Does the heap information say how many entries are there in the
> InteralCachedBag's ArrayList ?
> There's 6 big Array lists, and the size is about 372692
> Here's the screen snapshot of the heap dump:
>
> screen snapshot 1: you can see there's 6 big POForeEach instances
>
> <aa.jpg>
>
> screen snapshot 2: you can see the memory are mostly retained by the big array list.
>
> <bb.jpg>
>
> screen snapshot 3: you can see the big array list is referenced by InternalCachedBag:
>
> <cc.jpg>
>
> > What version of pig are you using?
> pig-0.9.2, I've read the latest source code of pig from github, and I don't find any improvements on IntercalCachedBag.
>
>
> Haitao Yao
> [EMAIL PROTECTED]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
>
> 在 2012-7-12,上午8:56, Thejas Nair 写道�
�>
>> Haitao,
>> Is your query using combiner ? Can you send the explain plan output ?
>> Does the heap information say how many entries are there in the
>> InteralCachedBag's ArrayList ?
>> What version of pig are you using ?
>>
>>
>> Thanks,
>> Thejas
>>
>>
>> On 7/10/12 11:50 PM, Haitao Yao wrote:
>>> Oh, new discovery: we can not set pig.cachedbag.memusage = 0 because
>>> every time the InternalCachedBag spills, It creates a new tmp file in
>>> java.io.tmpdir. if we set pig.cachedbag.memusage to 0 , every new tuple
>>> added into InternalCachedBag will create a new tmp file. And the tmp
>>> file is deleted on exit.
>>> So , if you're unlucky like me, you will get a OOM Exception caused by
>>> java.io.DeleteOnExitHook!
>>> Here's the evidence:
>>>
>>> God, we really need a full description of how every parameter works.
>>>
>>>
>>>
>>> Haitao Yao
>>> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>>
>>> 在 2012-7-10,下午4:20, Haitao Yao 写道�
�>>>
>>>> I found the solution.
>>>>
>>>> After analyzing the heap dump while the reducer OOM, I found out the
>>>> memory is consumed by org.apache.pig.data.InternalCachedBag , here's
>>>> the diagram:
>>>> <cc.jpg>
>>>>
>>>> In the source code of org.apache.pig.data.InternalCachedBag, I found
>>>> out there's a parameter for the cache limit:
>>>> *public* InternalCachedBag(*int* bagCount) {
>>>> *float* percent = 0.2F;
>>>>
>>>> *if* (PigMapReduce./sJobConfInternal/.get() != *null*) {
>>>> // here, the cache limit is from here!
>>>> String usage =
>>>> PigMapReduce./sJobConfInternal/.get().get("pig.cachedbag.memusage");
>>>> *if* (usage != *null*) {
>>>> percent = Float./parseFloat/(usage);
>>>> }
>>>> }
>>>>
>>>>        init(bagCount, percent);
>>>>    }
>>>> *private* *void* init(*int* bagCount, *float* percent) {
>>>> factory = TupleFactory./getInstance/();
>>>> mContents = *new* ArrayList<Tuple>();
>>>>
>>>> *long* max = Runtime./getRuntime/().maxMemory();
>>>> maxMemUsage = (*long*)(((*float*)max * percent) / (*float*)bagCount);
>>>> cacheLimit = Integer./MAX_VALUE/;
>>>>
>>>> // set limit to 0, if memusage is 0 or really really small.
>>>> // then all tuples are put into disk
>>>> *if* (maxMemUsage < 1) {
>>>> cacheLimit = 0;
>>>>        }
>>>> /log/.warn("cacheLimit: " + *this*.cacheLimit);
>>>> addDone = *false*;
>>>>    }
>>>>
>>>> so, after write pig.cachedbag.memusage=0 into
>>>> $PIG_HOME/conf/pig.properties, my job successes!
>>>>
>>>> You can also set to an appropriate value to fully utilize your memory
>>>> as a cache.
>>>>
>>>> Hope this is useful for others.
>>>> Thanks.
>>>>
>>>>
>>>> Haitao Yao
>>>> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>>
>�>>>>