Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Plain View
Pig, mail # user - SAMPLE after a GROUP BY


+
Jacob Perkins 2011-04-24, 17:41
+
Alan Gates 2011-04-25, 16:02
Copy link to this message
-
Re: SAMPLE after a GROUP BY
Jacob Perkins 2011-04-26, 14:43
JIRA filed, see:

https://issues.apache.org/jira/browse/PIG-2014

--jacob
@thedatachef

On Mon, 2011-04-25 at 09:02 -0700, Alan Gates wrote:
> You are not insane.  Pig rewrites sample into filter, and then pushes  
> that filter in front of the group.  It shouldn't push that filter  
> since the UDF is non-deterministic.  If you add "-t PushUpFilter" to  
> your command line when invoking pig this won't happen.  Could you file  
> a JIRA for this so we keep track of it?
>
> Alan.
>
> On Apr 24, 2011, at 10:41 AM, Jacob Perkins wrote:
>
> > So I'm running into something strange. Consider the following code:
> >
> > tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray,
> > weight:double);
> > grouped = GROUP tfidf_all BY doc_id;
> > vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token,
> > weight) AS vector;
> > DUMP vectors;
> >
> > This, of course, runs just fine. tfidf_all contains 1,428,280 records.
> > The reduce output records should be exactly the number of documents,
> > which turn out to be 18,863 in this case. All well and good.
> >
> > The strangeness comes when I add a SAMPLE command:
> >
> > sampled = SAMPLE vectors 0.0012;
> > DUMP sampled;
> >
> > Running this results in 1,513 reduce output records. So, am I insane  
> > or
> > shouldn't the reduce output records be much much closer to 22 or 23
> > records (eg. 0.0012*18863)?
> >
> > --jacob
> > @thedatachef
> >
>