JIRA filed, see:
On Mon, 2011-04-25 at 09:02 -0700, Alan Gates wrote:
> You are not insane. Pig rewrites sample into filter, and then pushes
> that filter in front of the group. It shouldn't push that filter
> since the UDF is non-deterministic. If you add "-t PushUpFilter" to
> your command line when invoking pig this won't happen. Could you file
> a JIRA for this so we keep track of it?
> On Apr 24, 2011, at 10:41 AM, Jacob Perkins wrote:
> > So I'm running into something strange. Consider the following code:
> > tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray,
> > weight:double);
> > grouped = GROUP tfidf_all BY doc_id;
> > vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token,
> > weight) AS vector;
> > DUMP vectors;
> > This, of course, runs just fine. tfidf_all contains 1,428,280 records.
> > The reduce output records should be exactly the number of documents,
> > which turn out to be 18,863 in this case. All well and good.
> > The strangeness comes when I add a SAMPLE command:
> > sampled = SAMPLE vectors 0.0012;
> > DUMP sampled;
> > Running this results in 1,513 reduce output records. So, am I insane
> > or
> > shouldn't the reduce output records be much much closer to 22 or 23
> > records (eg. 0.0012*18863)?
> > --jacob
> > @thedatachef