Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
Sorry , I sent the mail only to Thejas.

Resend it for all.
Haitao Yao
[EMAIL PROTECTED]
weibo: @haitao_yao
Skype:  haitao.yao.final

在 2012-7-12,上午10:41, Haitao Yao 写道:

>
>
> > Is your query using combiner ?
> I did know how to explicitly use combiner.
>
> > Can you send the explain plan output ?
> The explain result is in the attachment. It's a little long.
>
> <aa.explain>
>
> > Does the heap information say how many entries are there in the
> InteralCachedBag's ArrayList ?
> There's 6 big Array lists, and the size is about 372692
> Here's the screen snapshot of the heap dump:
>
> screen snapshot 1: you can see there's 6 big POForeEach instances
>
> <aa.jpg>
>
> screen snapshot 2: you can see the memory are mostly retained by the big array list.
>
> <bb.jpg>
>
> screen snapshot 3: you can see the big array list is referenced by InternalCachedBag:
>
> <cc.jpg>
>
> > What version of pig are you using?
> pig-0.9.2, I've read the latest source code of pig from github, and I don't find any improvements on IntercalCachedBag.
>
>
> Haitao Yao
> [EMAIL PROTECTED]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
>
> 在 2012-7-12,上午8:56, Thejas Nair 写道�
�>
>> Haitao,
>> Is your query using combiner ? Can you send the explain plan output ?
>> Does the heap information say how many entries are there in the
>> InteralCachedBag's ArrayList ?
>> What version of pig are you using ?
>>
>>
>> Thanks,
>> Thejas
>>
>>
>> On 7/10/12 11:50 PM, Haitao Yao wrote:
>>> Oh, new discovery: we can not set pig.cachedbag.memusage = 0 because
>>> every time the InternalCachedBag spills, It creates a new tmp file in
>>> java.io.tmpdir. if we set pig.cachedbag.memusage to 0 , every new tuple
>>> added into InternalCachedBag will create a new tmp file. And the tmp
>>> file is deleted on exit.
>>> So , if you're unlucky like me, you will get a OOM Exception caused by
>>> java.io.DeleteOnExitHook!
>>> Here's the evidence:
>>>
>>> God, we really need a full description of how every parameter works.
>>>
>>>
>>>
>>> Haitao Yao
>>> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>>> weibo: @haitao_yao
>>> Skype:  haitao.yao.final
>>>
>>> 在 2012-7-10,下午4:20, Haitao Yao 写道�
�>>>
>>>> I found the solution.
>>>>
>>>> After analyzing the heap dump while the reducer OOM, I found out the
>>>> memory is consumed by org.apache.pig.data.InternalCachedBag , here's
>>>> the diagram:
>>>> <cc.jpg>
>>>>
>>>> In the source code of org.apache.pig.data.InternalCachedBag, I found
>>>> out there's a parameter for the cache limit:
>>>> *public* InternalCachedBag(*int* bagCount) {
>>>> *float* percent = 0.2F;
>>>>
>>>> *if* (PigMapReduce./sJobConfInternal/.get() != *null*) {
>>>> // here, the cache limit is from here!
>>>> String usage =
>>>> PigMapReduce./sJobConfInternal/.get().get("pig.cachedbag.memusage");
>>>> *if* (usage != *null*) {
>>>> percent = Float./parseFloat/(usage);
>>>> }
>>>> }
>>>>
>>>>        init(bagCount, percent);
>>>>    }
>>>> *private* *void* init(*int* bagCount, *float* percent) {
>>>> factory = TupleFactory./getInstance/();
>>>> mContents = *new* ArrayList<Tuple>();
>>>>
>>>> *long* max = Runtime./getRuntime/().maxMemory();
>>>> maxMemUsage = (*long*)(((*float*)max * percent) / (*float*)bagCount);
>>>> cacheLimit = Integer./MAX_VALUE/;
>>>>
>>>> // set limit to 0, if memusage is 0 or really really small.
>>>> // then all tuples are put into disk
>>>> *if* (maxMemUsage < 1) {
>>>> cacheLimit = 0;
>>>>        }
>>>> /log/.warn("cacheLimit: " + *this*.cacheLimit);
>>>> addDone = *false*;
>>>>    }
>>>>
>>>> so, after write pig.cachedbag.memusage=0 into
>>>> $PIG_HOME/conf/pig.properties, my job successes!
>>>>
>>>> You can also set to an appropriate value to fully utilize your memory
>>>> as a cache.
>>>>
>>>> Hope this is useful for others.
>>>> Thanks.
>>>>
>>>>
>>>> Haitao Yao
>>>> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
>>>> weibo: @haitao_yao
>>>> Skype:  haitao.yao.final
>>>>
>�>>>>
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB