Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig >> mail # dev >> Is there a way around the nested distinct problem?


+
Jonathan Coveney 2011-11-24, 03:44
Copy link to this message
-
Re: Is there a way around the nested distinct problem?
If you are willing to give up some (very small) precision, for this
specific kind of queries, you can use approximate counters like
Flajolet-Martin or HyperLogLog counters.
We could implement them in a special COUNT_APPROX() builtin function.
You can also use bloom filters to have an approximate distinct
implementation.

For the general case, I think there is no solution.
Inherently a nested operation is handled locally, so memory restrictions
apply.

Cheers,
--
Gianmarco

On Thu, Nov 24, 2011 at 04:44, Jonathan Coveney <[EMAIL PROTECTED]> wrote:

> The problem being that as datasets grow, a nested distinct can often lead
> to a heap error (I've been testing some scripts in Pig9, and for whatever
> reason a bunch of scripts that are on the edge in pig8 are dying in pig9
> with heap errors caused by distinct...but there are a lot of moving parts
> there). Either way, I'm of the opinion that heap errors are bad!
>
> I was wondering if there are any known methods (or papers in academia) of
> efficient ways around this, short of grouping the data two separate times?
>
> so for example, we have
>
> a = load 'thing' as (x,y);
> b = foreach (group a by x) {
>  dst=distinct a.y;
>  generate group COUNT(dst), COUNT(a);
> }
>
> so this gives is the total number of y per x, and the distinct number of y
> per x (you can imagine that x is a website, y is a cookie, and the values
> are distinct cookies and page views.
>
> So in this case, eventually certain sites may get really popular and there
> could be enough distinct cookies to kill you.
>
> Now, the way that I'd normally refactor the code is...
>
> a = load 'thing' as (x,y);
> b = foreach (group a by x) generate group, COUNT(a);
>
> c = foreach (group a by (x,y)) generate flatten(group) as (x,y);
> d = foreach (group c by x) generate group, COUNT(c);
>
> and then you join them on the group key to merge the two. Nasty!
>
> Are there any ways to optimize these sorts of queries on the Pig side to
> avoid memory issues, and to keep the syntax clean?
>
> I appreciate the thought
> Jon
>
+
John Meagher 2011-11-24, 15:16
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB