Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # dev >> Is there a way around the nested distinct problem?


Copy link to this message
-
Re: Is there a way around the nested distinct problem?
If you are willing to give up some (very small) precision, for this
specific kind of queries, you can use approximate counters like
Flajolet-Martin or HyperLogLog counters.
We could implement them in a special COUNT_APPROX() builtin function.
You can also use bloom filters to have an approximate distinct
implementation.

For the general case, I think there is no solution.
Inherently a nested operation is handled locally, so memory restrictions
apply.

Cheers,
--
Gianmarco

On Thu, Nov 24, 2011 at 04:44, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> The problem being that as datasets grow, a nested distinct can often lead
> to a heap error (I've been testing some scripts in Pig9, and for whatever
> reason a bunch of scripts that are on the edge in pig8 are dying in pig9
> with heap errors caused by distinct...but there are a lot of moving parts
> there). Either way, I'm of the opinion that heap errors are bad!
>
> I was wondering if there are any known methods (or papers in academia) of
> efficient ways around this, short of grouping the data two separate times?
>
> so for example, we have
>
> a = load 'thing' as (x,y);
> b = foreach (group a by x) {
>  dst=distinct a.y;
>  generate group COUNT(dst), COUNT(a);
> }
>
> so this gives is the total number of y per x, and the distinct number of y
> per x (you can imagine that x is a website, y is a cookie, and the values
> are distinct cookies and page views.
>
> So in this case, eventually certain sites may get really popular and there
> could be enough distinct cookies to kill you.
>
> Now, the way that I'd normally refactor the code is...
>
> a = load 'thing' as (x,y);
> b = foreach (group a by x) generate group, COUNT(a);
>
> c = foreach (group a by (x,y)) generate flatten(group) as (x,y);
> d = foreach (group c by x) generate group, COUNT(c);
>
> and then you join them on the group key to merge the two. Nasty!
>
> Are there any ways to optimize these sorts of queries on the Pig side to
> avoid memory issues, and to keep the syntax clean?
>
> I appreciate the thought
> Jon
>