


Is there a way around the nested distinct problem?
Jonathan Coveney 20111124, 03:44
The problem being that as datasets grow, a nested distinct can often lead to a heap error (I've been testing some scripts in Pig9, and for whatever reason a bunch of scripts that are on the edge in pig8 are dying in pig9 with heap errors caused by distinct...but there are a lot of moving parts there). Either way, I'm of the opinion that heap errors are bad!
I was wondering if there are any known methods (or papers in academia) of efficient ways around this, short of grouping the data two separate times?
so for example, we have
a = load 'thing' as (x,y); b = foreach (group a by x) { dst=distinct a.y; generate group COUNT(dst), COUNT(a); }
so this gives is the total number of y per x, and the distinct number of y per x (you can imagine that x is a website, y is a cookie, and the values are distinct cookies and page views.
So in this case, eventually certain sites may get really popular and there could be enough distinct cookies to kill you.
Now, the way that I'd normally refactor the code is...
a = load 'thing' as (x,y); b = foreach (group a by x) generate group, COUNT(a);
c = foreach (group a by (x,y)) generate flatten(group) as (x,y); d = foreach (group c by x) generate group, COUNT(c);
and then you join them on the group key to merge the two. Nasty!
Are there any ways to optimize these sorts of queries on the Pig side to avoid memory issues, and to keep the syntax clean?
I appreciate the thought Jon

Re: Is there a way around the nested distinct problem?
Gianmarco De Francisci Mo... 20111124, 11:27
If you are willing to give up some (very small) precision, for this specific kind of queries, you can use approximate counters like FlajoletMartin or HyperLogLog counters. We could implement them in a special COUNT_APPROX() builtin function. You can also use bloom filters to have an approximate distinct implementation.
For the general case, I think there is no solution. Inherently a nested operation is handled locally, so memory restrictions apply.
Cheers,  Gianmarco
On Thu, Nov 24, 2011 at 04:44, Jonathan Coveney <[EMAIL PROTECTED]> wrote:
> The problem being that as datasets grow, a nested distinct can often lead > to a heap error (I've been testing some scripts in Pig9, and for whatever > reason a bunch of scripts that are on the edge in pig8 are dying in pig9 > with heap errors caused by distinct...but there are a lot of moving parts > there). Either way, I'm of the opinion that heap errors are bad! > > I was wondering if there are any known methods (or papers in academia) of > efficient ways around this, short of grouping the data two separate times? > > so for example, we have > > a = load 'thing' as (x,y); > b = foreach (group a by x) { > dst=distinct a.y; > generate group COUNT(dst), COUNT(a); > } > > so this gives is the total number of y per x, and the distinct number of y > per x (you can imagine that x is a website, y is a cookie, and the values > are distinct cookies and page views. > > So in this case, eventually certain sites may get really popular and there > could be enough distinct cookies to kill you. > > Now, the way that I'd normally refactor the code is... > > a = load 'thing' as (x,y); > b = foreach (group a by x) generate group, COUNT(a); > > c = foreach (group a by (x,y)) generate flatten(group) as (x,y); > d = foreach (group c by x) generate group, COUNT(c); > > and then you join them on the group key to merge the two. Nasty! > > Are there any ways to optimize these sorts of queries on the Pig side to > avoid memory issues, and to keep the syntax clean? > > I appreciate the thought > Jon >

Re: Is there a way around the nested distinct problem?
John Meagher 20111124, 15:16
I've done this with the following: raw = load 'thing' as (user,page);pageviews = foreach (group raw by (user, page)) generate flatten(group), count($1) as pageviews;
pagecounts = foreach ( group pageviews by page ) generate flatten(group), count($1) as uniques, sum(pageviews) as pageviews;
It's the only way I've been able to get it to scale. On Wed, Nov 23, 2011 at 22:44, Jonathan Coveney <[EMAIL PROTECTED]> wrote: > The problem being that as datasets grow, a nested distinct can often lead > to a heap error (I've been testing some scripts in Pig9, and for whatever > reason a bunch of scripts that are on the edge in pig8 are dying in pig9 > with heap errors caused by distinct...but there are a lot of moving parts > there). Either way, I'm of the opinion that heap errors are bad! > > I was wondering if there are any known methods (or papers in academia) of > efficient ways around this, short of grouping the data two separate times? > > so for example, we have > > a = load 'thing' as (x,y); > b = foreach (group a by x) { > dst=distinct a.y; > generate group COUNT(dst), COUNT(a); > } > > so this gives is the total number of y per x, and the distinct number of y > per x (you can imagine that x is a website, y is a cookie, and the values > are distinct cookies and page views. > > So in this case, eventually certain sites may get really popular and there > could be enough distinct cookies to kill you. > > Now, the way that I'd normally refactor the code is... > > a = load 'thing' as (x,y); > b = foreach (group a by x) generate group, COUNT(a); > > c = foreach (group a by (x,y)) generate flatten(group) as (x,y); > d = foreach (group c by x) generate group, COUNT(c); > > and then you join them on the group key to merge the two. Nasty! > > Are there any ways to optimize these sorts of queries on the Pig side to > avoid memory issues, and to keep the syntax clean? > > I appreciate the thought > Jon >

