|
|
-
Re: What is the best way to do counting in pig?Haitao Yao 2012-08-09, 14:51
Hey, all, I've submitted the patch for PIG-2182, here's the link: https://issues.apache.org/jira/browse/PIG-2812
I didn't change the data bags to spill into only one file, since that's a very big modification. I just let the DefaultAbstractDataBag spill into one directory and delete the directory recursively with ShutdownHook. Haitao Yao [EMAIL PROTECTED] weibo: @haitao_yao Skype: haitao.yao.final 在 2012-7-12,上午11:04, Haitao Yao 写道: > Sorry. here's the full mail. > > > > Is your query using combiner ? > I did know how to explicitly use combiner. > > > Can you send the explain plan output ? > The explain result is in the attachment. It's a little long. link: http://pastebin.com/Q6CvKiP1 > > <aa.explain> > > > > Does the heap information say how many entries are there in the > InteralCachedBag's ArrayList ? > There's 6 big Array lists, and the size is about 372692 > Here's the screen snapshot of the heap dump: > > screen snapshot 1: you can see there's 6 big POForeEach instances > > <aa.jpg> > > > screen snapshot 2: you can see the memory are mostly retained by the big array list. > > <bb.jpg> > > > screen snapshot 3: you can see the big array list is referenced by InternalCachedBag: > > <cc.jpg> > > > > What version of pig are you using? > pig-0.9.2, I've read the latest source code of pig from github, and I don't find any improvements on IntercalCachedBag > > > > 在 2012-7-12,上午10:58, Jonathan Coveney 写� 溃�> >> the listserv strips attachments. you'll have to host it somewhere else and >> link it >> >> 2012/7/11 Haitao Yao <[EMAIL PROTECTED]> >> >>> Sorry , I sent the mail only to Thejas. >>> >>> Resend it for all. >>> >>> >>> Haitao Yao >>> [EMAIL PROTECTED] >>> weibo: @haitao_yao >>> Skype: haitao.yao.final >>> >>> 在 2012-7-12,上午10:41, Haitao Yao 写道� �>>> >>>> >>>> >>>>> Is your query using combiner ? >>>> I did know how to explicitly use combiner. >>>> >>>>> Can you send the explain plan output ? >>>> The explain result is in the attachment. It's a little long. >>>> >>>> <aa.explain> >>>> >>>>> Does the heap information say how many entries are there in the >>>> InteralCachedBag's ArrayList ? >>>> There's 6 big Array lists, and the size is about 372692 >>>> Here's the screen snapshot of the heap dump: >>>> >>>> screen snapshot 1: you can see there's 6 big POForeEach instances >>>> >>>> <aa.jpg> >>>> >>>> screen snapshot 2: you can see the memory are mostly >>> retained by the big array list. >>>> >>>> <bb.jpg> >>>> >>>> screen snapshot 3: you can see the big array list is >>> referenced by InternalCachedBag: >>>> >>>> <cc.jpg> >>>> >>>>> What version of pig are you using? >>>> pig-0.9.2, I've read the latest source code of pig from github, >>> and I don't find any improvements on IntercalCachedBag. >>>> >>>> >>>> Haitao Yao >>>> [EMAIL PROTECTED] >>>> weibo: @haitao_yao >>>> Skype: haitao.yao.final >>>> >>>> 在 2012-7-12,上午8:56, Thejas Nair 写道: >>>> >>>>> Haitao, >>>>> Is your query using combiner ? Can you send the explain plan output ? >>>>> Does the heap information say how many entries are there in the >>>>> InteralCachedBag's ArrayList ? >>>>> What version of pig are you using ? >>>>> >>>>> >>>>> Thanks, >>>>> Thejas >>>>> >>>>> >>>>> On 7/10/12 11:50 PM, Haitao Yao wrote: >>>>>> Oh, new discovery: we can not set pig.cachedbag.memusage = 0 because >>>>>> every time the InternalCachedBag spills, It creates a new tmp file in >>>>>> java.io.tmpdir. if we set pig.cachedbag.memusage to 0 , every new tuple >>>>>> added into InternalCachedBag will create a new tmp file. And the tmp >>>>>> file is deleted on exit. >>>>>> So , if you're unlucky like me, you will get a OOM Exception caused by >>>>>> java.io.DeleteOnExitHook! >>>>>> Here's the evidence: >>>>>> >>>>>> God, we really need a full description of how every parameter works. >>>>>> >>>>>> >>>>>> >>>>>> Haitao Yao >>>>>> [EMAIL PROTECTED] <mailto:[EMAIL PROTECTED]> 溃�>>>>>>>>>> |