I did try 2888 ( I checked out new from trunk and applied the patch ) and
unfortunately it was not making much difference for me. You have mentioned
other distinct counting approaches. Could you please give me more details
and any hints to implement those.
On Wed, Aug 29, 2012 at 1:05 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote:
> Thanks Dmitry.
> 1) yup. exact distinct counts are required, since it is finance reporting.
> ( I actually had thought about bloom filter but since we need exact count
> it might not be applicable )
> 2) Oh I think Pig 2888 recently filed, it didnt come in my search
> previously. Sure I will apply the patch and see if that makes any
> Thanks very much for responding....
> On Tue, Aug 28, 2012 at 11:45 PM, Dmitriy Ryaboy <[EMAIL PROTECTED]>wrote:
>> Couple of ideas:
>> 1) do you need exact distinct counts? There are approximate distinct
>> counting approaches that may be appropriate an much more efficient.
>> 2) can you try with pig-2888?
>> On Aug 28, 2012, at 1:35 PM, Deepak Tiwari <[EMAIL PROTECTED]> wrote:
>> > Hi,
>> > I am processing huge dataset and need to aggregate data using on
>> > levels ( columns ).
>> > for example A,B,C,D,E,F, CalculateDistinctinctOnValue1,
>> > CalculateDistinctinctOnValue2, Sum(value3)
>> > I have tried two approaches in one I am reading the file one time and
>> > generating groupby on each level
>> > for example group by (A,B), group by (A,B,C)
>> > Since I have to do distinct inside foreach which is taking too much
>> > mostly because of skew. ( I have enabled multiquery)
>> > In another approach I have tried creating 8 separate scripts to process
>> > each group by too, but that is taking more or less the same time and
>> not a
>> > very efficient one. Could someone please suggest any other way..
>> > Thanks in advance.
>> > Deepak