Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - What is the best way to do counting in pig?


Copy link to this message
-
Re: What is the best way to do counting in pig?
Jonathan Coveney 2012-07-12, 02:58
the listserv strips attachments. you'll have to host it somewhere else and
link it

2012/7/11 Haitao Yao <[EMAIL PROTECTED]>

> Sorry , I sent the mail only to Thejas.
>
> Resend it for all.
>
>
> Haitao Yao
> [EMAIL PROTECTED]
> weibo: @haitao_yao
> Skype:  haitao.yao.final
>
> 在 2012-7-12,上午10:41, Haitao Yao 写道:
>
> >
> >
> > > Is your query using combiner ?
> >       I did know how to explicitly use combiner.
> >
> > > Can you send the explain plan output ?
> >       The explain result is in the attachment. It's a little long.
> >
> > <aa.explain>
> >
> > > Does the heap information say how many entries are there in the
> > InteralCachedBag's ArrayList ?
> >       There's 6 big Array lists, and the size is about 372692
> >       Here's the screen snapshot of the heap dump:
> >
> >       screen snapshot 1: you can see there's 6 big POForeEach instances
> >
> > <aa.jpg>
> >
> >               screen snapshot 2: you can see the memory are mostly
> retained by the big array list.
> >
> > <bb.jpg>
> >
> >               screen snapshot 3: you can see the big array list is
> referenced by InternalCachedBag:
> >
> > <cc.jpg>
> >
> > > What version of pig are you using?
> >       pig-0.9.2, I've read the latest source code of pig from github,
> and I don't find any improvements on IntercalCachedBag.
> >
> >
> > Haitao Yao
> > [EMAIL PROTECTED]
> > weibo: @haitao_yao
> > Skype:  haitao.yao.final
> >
> > 在 2012-7-12,上午8:56, Thejas Nair 写道:
> >
> >> Haitao,
> >> Is your query using combiner ? Can you send the explain plan output ?
> >> Does the heap information say how many entries are there in the
> >> InteralCachedBag's ArrayList ?
> >> What version of pig are you using ?
> >>
> >>
> >> Thanks,
> >> Thejas
> >>
> >>
> >> On 7/10/12 11:50 PM, Haitao Yao wrote:
> >>> Oh, new discovery: we can not set pig.cachedbag.memusage = 0 because
> >>> every time the InternalCachedBag spills, It creates a new tmp file in
> >>> java.io.tmpdir. if we set pig.cachedbag.memusage to 0 , every new tuple
> >>> added into InternalCachedBag will create a new tmp file. And the tmp
> >>> file is deleted on exit.
> >>> So , if you're unlucky like me, you will get a OOM Exception caused by
> >>> java.io.DeleteOnExitHook!
> >>> Here's the evidence:
> >>>
> >>> God, we really need a full description of how every parameter works.
> >>>
> >>>
> >>>
> >>> Haitao Yao
> >>> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]>
> >>> weibo: @haitao_yao
> >>> Skype:  haitao.yao.final
> >>>
> >>> 在 2012-7-10,下午4:20, Haitao Yao 写道:
> >>>
> >>>> I found the solution.
> >>>>
> >>>> After analyzing the heap dump while the reducer OOM, I found out the
> >>>> memory is consumed by org.apache.pig.data.InternalCachedBag , here's
> >>>> the diagram:
> >>>> <cc.jpg>
> >>>>
> >>>> In the source code of org.apache.pig.data.InternalCachedBag, I found
> >>>> out there's a parameter for the cache limit:
> >>>> *public* InternalCachedBag(*int* bagCount) {
> >>>> *float* percent = 0.2F;
> >>>>
> >>>> *if* (PigMapReduce./sJobConfInternal/.get() != *null*) {
> >>>> // here, the cache limit is from here!
> >>>> String usage > >>>> PigMapReduce./sJobConfInternal/.get().get("pig.cachedbag.memusage");
> >>>> *if* (usage != *null*) {
> >>>> percent = Float./parseFloat/(usage);
> >>>> }
> >>>> }
> >>>>
> >>>>        init(bagCount, percent);
> >>>>    }
> >>>> *private* *void* init(*int* bagCount, *float* percent) {
> >>>> factory = TupleFactory./getInstance/();
> >>>> mContents = *new* ArrayList<Tuple>();
> >>>>
> >>>> *long* max = Runtime./getRuntime/().maxMemory();
> >>>> maxMemUsage = (*long*)(((*float*)max * percent) / (*float*)bagCount);
> >>>> cacheLimit = Integer./MAX_VALUE/;
> >>>>
> >>>> // set limit to 0, if memusage is 0 or really really small.
> >>>> // then all tuples are put into disk
> >>>> *if* (maxMemUsage < 1) {
> >>>> cacheLimit = 0;
> >>>>        }
> >>>> /log/.warn("cacheLimit: " + *this*.cacheLimit);
> >>>> addDone = *false*;
> >>>>    }
> >>>>
> >>>> so, after write pig.cachedbag.memusage=0 into