|
Deepak Tiwari
2012-08-28, 20:35
Dmitriy Ryaboy
2012-08-29, 06:45
Deepak Tiwari
2012-08-29, 20:05
Deepak Tiwari
2012-09-28, 21:40
Dmitriy Ryaboy
2012-09-28, 22:12
Deepak Tiwari
2012-09-28, 22:27
Dmitriy Ryaboy
2012-09-28, 22:58
Deepak Tiwari
2012-09-28, 23:15
|
-
Pig multiple groupby problemDeepak Tiwari 2012-08-28, 20:35
Hi,
I am processing huge dataset and need to aggregate data using on multiple levels ( columns ). for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, CalculateDistinctinctOnValue2, Sum(value3) I have tried two approaches in one I am reading the file one time and generating groupby on each level for example group by (A,B), group by (A,B,C) Since I have to do distinct inside foreach which is taking too much time, mostly because of skew. ( I have enabled multiquery) In another approach I have tried creating 8 separate scripts to process each group by too, but that is taking more or less the same time and not a very efficient one. Could someone please suggest any other way.. Thanks in advance. Deepak
-
Re: Pig multiple groupby problemDmitriy Ryaboy 2012-08-29, 06:45
Couple of ideas:
1) do you need exact distinct counts? There are approximate distinct counting approaches that may be appropriate an much more efficient. 2) can you try with pig-2888? On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote: > Hi, > > I am processing huge dataset and need to aggregate data using on multiple > levels ( columns ). > > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > CalculateDistinctinctOnValue2, Sum(value3) > > I have tried two approaches in one I am reading the file one time and > generating groupby on each level > > for example group by (A,B), group by (A,B,C) > > Since I have to do distinct inside foreach which is taking too much time, > mostly because of skew. ( I have enabled multiquery) > > In another approach I have tried creating 8 separate scripts to process > each group by too, but that is taking more or less the same time and not a > very efficient one. Could someone please suggest any other way.. > > Thanks in advance. > > > Deepak
-
Re: Pig multiple groupby problemDeepak Tiwari 2012-08-29, 20:05
Thanks Dmitry.
1) yup. exact distinct counts are required, since it is finance reporting. ( I actually had thought about bloom filter but since we need exact count it might not be applicable ) 2) Oh I think Pig 2888 recently filed, it didnt come in my search previously. Sure I will apply the patch and see if that makes any difference.. Thanks very much for responding.... On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Couple of ideas: > > 1) do you need exact distinct counts? There are approximate distinct > counting approaches that may be appropriate an much more efficient. > 2) can you try with pig-2888? > > On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote: > > > Hi, > > > > I am processing huge dataset and need to aggregate data using on multiple > > levels ( columns ). > > > > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > > CalculateDistinctinctOnValue2, Sum(value3) > > > > I have tried two approaches in one I am reading the file one time and > > generating groupby on each level > > > > for example group by (A,B), group by (A,B,C) > > > > Since I have to do distinct inside foreach which is taking too much time, > > mostly because of skew. ( I have enabled multiquery) > > > > In another approach I have tried creating 8 separate scripts to process > > each group by too, but that is taking more or less the same time and not > a > > very efficient one. Could someone please suggest any other way.. > > > > Thanks in advance. > > > > > > Deepak >
-
Re: Pig multiple groupby problemDeepak Tiwari 2012-09-28, 21:40
Hi Dmitriy
I did try 2888 ( I checked out new from trunk and applied the patch ) and unfortunately it was not making much difference for me. You have mentioned other distinct counting approaches. Could you please give me more details and any hints to implement those. Regards, Deepak. On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote: > Thanks Dmitry. > > 1) yup. exact distinct counts are required, since it is finance reporting. > ( I actually had thought about bloom filter but since we need exact count > it might not be applicable ) > 2) Oh I think Pig 2888 recently filed, it didnt come in my search > previously. Sure I will apply the patch and see if that makes any > difference.. > > Thanks very much for responding.... > > > > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote: > >> Couple of ideas: >> >> 1) do you need exact distinct counts? There are approximate distinct >> counting approaches that may be appropriate an much more efficient. >> 2) can you try with pig-2888? >> >> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote: >> >> > Hi, >> > >> > I am processing huge dataset and need to aggregate data using on >> multiple >> > levels ( columns ). >> > >> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, >> > CalculateDistinctinctOnValue2, Sum(value3) >> > >> > I have tried two approaches in one I am reading the file one time and >> > generating groupby on each level >> > >> > for example group by (A,B), group by (A,B,C) >> > >> > Since I have to do distinct inside foreach which is taking too much >> time, >> > mostly because of skew. ( I have enabled multiquery) >> > >> > In another approach I have tried creating 8 separate scripts to process >> > each group by too, but that is taking more or less the same time and >> not a >> > very efficient one. Could someone please suggest any other way.. >> > >> > Thanks in advance. >> > >> > >> > Deepak >> > >
-
Re: Pig multiple groupby problemDmitriy Ryaboy 2012-09-28, 22:12
When you tried 2888, did you have pig.exec.mapPartAgg set to true,
and pig.exec.mapPartAgg.minReduction set to a low value (2 or 3)? You said you applied the patch -- what version are you currently running? Other approaches are also probabilistic so if you need exact counts, no dice.. I was thinking bloom filters or hyper log log. D On Fri, Sep 28, 2012 at 2:40 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote: > Hi Dmitriy > > I did try 2888 ( I checked out new from trunk and applied the patch ) and > unfortunately it was not making much difference for me. You have mentioned > other distinct counting approaches. Could you please give me more details > and any hints to implement those. > > Regards, > > Deepak. > > On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[EMAIL PROTECTED]> > wrote: > > > Thanks Dmitry. > > > > 1) yup. exact distinct counts are required, since it is finance > reporting. > > ( I actually had thought about bloom filter but since we need exact count > > it might not be applicable ) > > 2) Oh I think Pig 2888 recently filed, it didnt come in my search > > previously. Sure I will apply the patch and see if that makes any > > difference.. > > > > Thanks very much for responding.... > > > > > > > > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[EMAIL PROTECTED] > >wrote: > > > >> Couple of ideas: > >> > >> 1) do you need exact distinct counts? There are approximate distinct > >> counting approaches that may be appropriate an much more efficient. > >> 2) can you try with pig-2888? > >> > >> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> > wrote: > >> > >> > Hi, > >> > > >> > I am processing huge dataset and need to aggregate data using on > >> multiple > >> > levels ( columns ). > >> > > >> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > >> > CalculateDistinctinctOnValue2, Sum(value3) > >> > > >> > I have tried two approaches in one I am reading the file one time and > >> > generating groupby on each level > >> > > >> > for example group by (A,B), group by (A,B,C) > >> > > >> > Since I have to do distinct inside foreach which is taking too much > >> time, > >> > mostly because of skew. ( I have enabled multiquery) > >> > > >> > In another approach I have tried creating 8 separate scripts to > process > >> > each group by too, but that is taking more or less the same time and > >> not a > >> > very efficient one. Could someone please suggest any other way.. > >> > > >> > Thanks in advance. > >> > > >> > > >> > Deepak > >> > > > > >
-
Re: Pig multiple groupby problemDeepak Tiwari 2012-09-28, 22:27
Yeah I believe pig.exec.mapPartAgg was true but I think minReduction was 10
or something. I will double check this and try that again. So If accuracy is compromised and Bloomfilter is chosen, should I follow the approach described at http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/Bloom.html... sorry I am bit hazy over here... On Fri, Sep 28, 2012 at 3:12 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > When you tried 2888, did you have pig.exec.mapPartAgg set to true, > and pig.exec.mapPartAgg.minReduction set to a low value (2 or 3)? > > You said you applied the patch -- what version are you currently running? > > Other approaches are also probabilistic so if you need exact counts, no > dice.. I was thinking bloom filters or hyper log log. > > D > > On Fri, Sep 28, 2012 at 2:40 PM, Deepak Tiwari <[EMAIL PROTECTED]> > wrote: > > > Hi Dmitriy > > > > I did try 2888 ( I checked out new from trunk and applied the patch ) > and > > unfortunately it was not making much difference for me. You have > mentioned > > other distinct counting approaches. Could you please give me more details > > and any hints to implement those. > > > > Regards, > > > > Deepak. > > > > On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[EMAIL PROTECTED]> > > wrote: > > > > > Thanks Dmitry. > > > > > > 1) yup. exact distinct counts are required, since it is finance > > reporting. > > > ( I actually had thought about bloom filter but since we need exact > count > > > it might not be applicable ) > > > 2) Oh I think Pig 2888 recently filed, it didnt come in my search > > > previously. Sure I will apply the patch and see if that makes any > > > difference.. > > > > > > Thanks very much for responding.... > > > > > > > > > > > > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[EMAIL PROTECTED] > > >wrote: > > > > > >> Couple of ideas: > > >> > > >> 1) do you need exact distinct counts? There are approximate distinct > > >> counting approaches that may be appropriate an much more efficient. > > >> 2) can you try with pig-2888? > > >> > > >> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> > > wrote: > > >> > > >> > Hi, > > >> > > > >> > I am processing huge dataset and need to aggregate data using on > > >> multiple > > >> > levels ( columns ). > > >> > > > >> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > > >> > CalculateDistinctinctOnValue2, Sum(value3) > > >> > > > >> > I have tried two approaches in one I am reading the file one time > and > > >> > generating groupby on each level > > >> > > > >> > for example group by (A,B), group by (A,B,C) > > >> > > > >> > Since I have to do distinct inside foreach which is taking too much > > >> time, > > >> > mostly because of skew. ( I have enabled multiquery) > > >> > > > >> > In another approach I have tried creating 8 separate scripts to > > process > > >> > each group by too, but that is taking more or less the same time and > > >> not a > > >> > very efficient one. Could someone please suggest any other way.. > > >> > > > >> > Thanks in advance. > > >> > > > >> > > > >> > Deepak > > >> > > > > > > > > >
-
Re: Pig multiple groupby problemDmitriy Ryaboy 2012-09-28, 22:58
Can you check if your mapper logs said anything about in-map aggregation
being turned off? In fact, the whole log of one of the mappers might help (POPartialAgg prints some helpful stats). On Fri, Sep 28, 2012 at 3:27 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote: > Yeah I believe pig.exec.mapPartAgg was true but I think minReduction was 10 > or something. I will double check this and try that again. So If accuracy > is compromised and Bloomfilter is chosen, should I follow the approach > described at > http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/Bloom.html. > .. > sorry I am bit hazy over here... > > On Fri, Sep 28, 2012 at 3:12 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > wrote: > > > When you tried 2888, did you have pig.exec.mapPartAgg set to true, > > and pig.exec.mapPartAgg.minReduction set to a low value (2 or 3)? > > > > You said you applied the patch -- what version are you currently running? > > > > Other approaches are also probabilistic so if you need exact counts, no > > dice.. I was thinking bloom filters or hyper log log. > > > > D > > > > On Fri, Sep 28, 2012 at 2:40 PM, Deepak Tiwari <[EMAIL PROTECTED]> > > wrote: > > > > > Hi Dmitriy > > > > > > I did try 2888 ( I checked out new from trunk and applied the patch ) > > and > > > unfortunately it was not making much difference for me. You have > > mentioned > > > other distinct counting approaches. Could you please give me more > details > > > and any hints to implement those. > > > > > > Regards, > > > > > > Deepak. > > > > > > On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Thanks Dmitry. > > > > > > > > 1) yup. exact distinct counts are required, since it is finance > > > reporting. > > > > ( I actually had thought about bloom filter but since we need exact > > count > > > > it might not be applicable ) > > > > 2) Oh I think Pig 2888 recently filed, it didnt come in my search > > > > previously. Sure I will apply the patch and see if that makes any > > > > difference.. > > > > > > > > Thanks very much for responding.... > > > > > > > > > > > > > > > > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[EMAIL PROTECTED] > > > >wrote: > > > > > > > >> Couple of ideas: > > > >> > > > >> 1) do you need exact distinct counts? There are approximate distinct > > > >> counting approaches that may be appropriate an much more efficient. > > > >> 2) can you try with pig-2888? > > > >> > > > >> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> > > > wrote: > > > >> > > > >> > Hi, > > > >> > > > > >> > I am processing huge dataset and need to aggregate data using on > > > >> multiple > > > >> > levels ( columns ). > > > >> > > > > >> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > > > >> > CalculateDistinctinctOnValue2, Sum(value3) > > > >> > > > > >> > I have tried two approaches in one I am reading the file one time > > and > > > >> > generating groupby on each level > > > >> > > > > >> > for example group by (A,B), group by (A,B,C) > > > >> > > > > >> > Since I have to do distinct inside foreach which is taking too > much > > > >> time, > > > >> > mostly because of skew. ( I have enabled multiquery) > > > >> > > > > >> > In another approach I have tried creating 8 separate scripts to > > > process > > > >> > each group by too, but that is taking more or less the same time > and > > > >> not a > > > >> > very efficient one. Could someone please suggest any other way.. > > > >> > > > > >> > Thanks in advance. > > > >> > > > > >> > > > > >> > Deepak > > > >> > > > > > > > > > > > > > >
-
Re: Pig multiple groupby problemDeepak Tiwari 2012-09-28, 23:15
Sure. I will deploy it today and run it again.. I usually check the job
conf file for verification.. but I will send you log files.. Thanks very much for help. Regards, Deepak On Fri, Sep 28, 2012 at 3:58 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote: > Can you check if your mapper logs said anything about in-map aggregation > being turned off? > In fact, the whole log of one of the mappers might help (POPartialAgg > prints some helpful stats). > > > On Fri, Sep 28, 2012 at 3:27 PM, Deepak Tiwari <[EMAIL PROTECTED]> > wrote: > > > Yeah I believe pig.exec.mapPartAgg was true but I think minReduction was > 10 > > or something. I will double check this and try that again. So If accuracy > > is compromised and Bloomfilter is chosen, should I follow the approach > > described at > > http://pig.apache.org/docs/r0.10.0/api/org/apache/pig/builtin/Bloom.html > . > > .. > > sorry I am bit hazy over here... > > > > On Fri, Sep 28, 2012 at 3:12 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> > > wrote: > > > > > When you tried 2888, did you have pig.exec.mapPartAgg set to true, > > > and pig.exec.mapPartAgg.minReduction set to a low value (2 or 3)? > > > > > > You said you applied the patch -- what version are you currently > running? > > > > > > Other approaches are also probabilistic so if you need exact counts, no > > > dice.. I was thinking bloom filters or hyper log log. > > > > > > D > > > > > > On Fri, Sep 28, 2012 at 2:40 PM, Deepak Tiwari <[EMAIL PROTECTED]> > > > wrote: > > > > > > > Hi Dmitriy > > > > > > > > I did try 2888 ( I checked out new from trunk and applied the patch > ) > > > and > > > > unfortunately it was not making much difference for me. You have > > > mentioned > > > > other distinct counting approaches. Could you please give me more > > details > > > > and any hints to implement those. > > > > > > > > Regards, > > > > > > > > Deepak. > > > > > > > > On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[EMAIL PROTECTED] > > > > > > wrote: > > > > > > > > > Thanks Dmitry. > > > > > > > > > > 1) yup. exact distinct counts are required, since it is finance > > > > reporting. > > > > > ( I actually had thought about bloom filter but since we need exact > > > count > > > > > it might not be applicable ) > > > > > 2) Oh I think Pig 2888 recently filed, it didnt come in my search > > > > > previously. Sure I will apply the patch and see if that makes any > > > > > difference.. > > > > > > > > > > Thanks very much for responding.... > > > > > > > > > > > > > > > > > > > > On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy < > [EMAIL PROTECTED] > > > > >wrote: > > > > > > > > > >> Couple of ideas: > > > > >> > > > > >> 1) do you need exact distinct counts? There are approximate > distinct > > > > >> counting approaches that may be appropriate an much more > efficient. > > > > >> 2) can you try with pig-2888? > > > > >> > > > > >> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> > > > > wrote: > > > > >> > > > > >> > Hi, > > > > >> > > > > > >> > I am processing huge dataset and need to aggregate data using on > > > > >> multiple > > > > >> > levels ( columns ). > > > > >> > > > > > >> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1, > > > > >> > CalculateDistinctinctOnValue2, Sum(value3) > > > > >> > > > > > >> > I have tried two approaches in one I am reading the file one > time > > > and > > > > >> > generating groupby on each level > > > > >> > > > > > >> > for example group by (A,B), group by (A,B,C) > > > > >> > > > > > >> > Since I have to do distinct inside foreach which is taking too > > much > > > > >> time, > > > > >> > mostly because of skew. ( I have enabled multiquery) > > > > >> > > > > > >> > In another approach I have tried creating 8 separate scripts to > > > > process > > > > >> > each group by too, but that is taking more or less the same time > > and > > > > >> not a > > > > >> > very efficient one. Could someone please suggest any other way.. > > > > >> > > > > > >> > Thanks in advance. |