Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Pig multiple groupby problem


Copy link to this message
-
Re: Pig multiple groupby problem
When you tried 2888, did you have pig.exec.mapPartAgg set to true,
and pig.exec.mapPartAgg.minReduction set to a low value (2 or 3)?

You said you applied the patch -- what version are you currently running?

Other approaches are also probabilistic so if you need exact counts, no
dice.. I was thinking bloom filters or hyper log log.

D

On Fri, Sep 28, 2012 at 2:40 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote:

> Hi Dmitriy
>
> I did try 2888  ( I checked out new from trunk and applied the patch  ) and
> unfortunately it was not making much difference for me.  You have mentioned
> other distinct counting approaches. Could you please give me more details
> and any hints to implement those.
>
> Regards,
>
> Deepak.
>
> On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[EMAIL PROTECTED]>
> wrote:
>
> > Thanks Dmitry.
> >
> > 1) yup. exact distinct counts are required, since it is finance
> reporting.
> > ( I actually had thought about bloom filter but since we need exact count
> > it might not be applicable )
> > 2) Oh I think Pig 2888 recently filed, it didnt come in my search
> > previously. Sure I will apply the patch and see if that makes any
> > difference..
> >
> > Thanks very much for responding....
> >
> >
> >
> > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]
> >wrote:
> >
> >> Couple of ideas:
> >>
> >> 1) do you need exact distinct counts? There are approximate distinct
> >> counting approaches that may be appropriate an much more efficient.
> >> 2) can you try with pig-2888?
> >>
> >> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]>
> wrote:
> >>
> >> > Hi,
> >> >
> >> > I am processing huge dataset and need to aggregate data using on
> >> multiple
> >> > levels ( columns ).
> >> >
> >> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1,
> >> > CalculateDistinctinctOnValue2, Sum(value3)
> >> >
> >> > I have tried two approaches in one I am reading the file one time and
> >> > generating groupby on each level
> >> >
> >> > for example group by (A,B), group by (A,B,C)
> >> >
> >> > Since I have to do distinct inside foreach which is taking too much
> >> time,
> >> > mostly because of skew. ( I have enabled multiquery)
> >> >
> >> > In another approach I have tried creating 8 separate scripts to
> process
> >> > each group by too, but that is taking more or less the same time and
> >> not a
> >> > very efficient one. Could someone please suggest any other way..
> >> >
> >> > Thanks in advance.
> >> >
> >> >
> >> > Deepak
> >>
> >
> >
>