Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> SAMPLE after a GROUP BY


Copy link to this message
-
SAMPLE after a GROUP BY
So I'm running into something strange. Consider the following code:

tfidf_all = LOAD '$TFIDF' AS (doc_id:chararray, token:chararray,
weight:double);
grouped = GROUP tfidf_all BY doc_id;
vectors = FOREACH grouped GENERATE group AS doc_id, tfidf_all.(token,
weight) AS vector;
DUMP vectors;

This, of course, runs just fine. tfidf_all contains 1,428,280 records.
The reduce output records should be exactly the number of documents,
which turn out to be 18,863 in this case. All well and good.

The strangeness comes when I add a SAMPLE command:

sampled = SAMPLE vectors 0.0012;
DUMP sampled;

Running this results in 1,513 reduce output records. So, am I insane or
shouldn't the reduce output records be much much closer to 22 or 23
records (eg. 0.0012*18863)?

--jacob
@thedatachef
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB