Deepak Tiwari 2012-08-28, 20:35
Dmitriy Ryaboy 2012-08-29, 06:45
1) yup. exact distinct counts are required, since it is finance reporting.
( I actually had thought about bloom filter but since we need exact count
it might not be applicable )
2) Oh I think Pig 2888 recently filed, it didnt come in my search
previously. Sure I will apply the patch and see if that makes any
Thanks very much for responding....
On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]> wrote:
> Couple of ideas:
> 1) do you need exact distinct counts? There are approximate distinct
> counting approaches that may be appropriate an much more efficient.
> 2) can you try with pig-2888?
> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote:
> > Hi,
> > I am processing huge dataset and need to aggregate data using on multiple
> > levels ( columns ).
> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1,
> > CalculateDistinctinctOnValue2, Sum(value3)
> > I have tried two approaches in one I am reading the file one time and
> > generating groupby on each level
> > for example group by (A,B), group by (A,B,C)
> > Since I have to do distinct inside foreach which is taking too much time,
> > mostly because of skew. ( I have enabled multiquery)
> > In another approach I have tried creating 8 separate scripts to process
> > each group by too, but that is taking more or less the same time and not
> > very efficient one. Could someone please suggest any other way..
> > Thanks in advance.
> > Deepak
Deepak Tiwari 2012-09-28, 21:40
Dmitriy Ryaboy 2012-09-28, 22:12
Deepak Tiwari 2012-09-28, 22:27
Dmitriy Ryaboy 2012-09-28, 22:58
Deepak Tiwari 2012-09-28, 23:15