Home | About | Sematext search-lucene.com search-hadoop.com
NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Order By Sampling


Copy link to this message
-
Re: Order By Sampling
The sampling algorithm for order-by samples 100 records from every map task, using a reservoir sampling algorithm.
I can't think of a way to store data that could adversely affect this sampling.
This is the class (a pig load function) that is involved in sampling - org.apache.pig.impl.builtin.RandomSampleLoader

There is room for improvement in the logic that determines the total number of samples  that are taken (instead of the arbitrary 100 records/map-task). But this seems to work well for most cases.

-Thejas
On 5/4/11 4:08 PM, "Brock Noland" <[EMAIL PROTECTED]> wrote:

Hello,

I am curious as to how PIG implements sampling for order by:

http://ofps.oreilly.com/titles/9781449302641/intro_pig_latin.html#order_by

Are there things I could when storing my data which would adversely
affect this sampling?

Thanks,
Brock

--

NEW: Monitor These Apps!
elasticsearch, apache solr, apache hbase, hadoop, redis, casssandra, amazon cloudwatch, mysql, memcached, apache kafka, apache zookeeper, apache storm, ubuntu, centOS, red hat, debian, puppet labs, java, senseiDB