Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Order By Sampling


Copy link to this message
-
Re: Order By Sampling
Thejas M Nair 2011-05-06, 19:33
The sampling algorithm for order-by samples 100 records from every map task, using a reservoir sampling algorithm.
I can't think of a way to store data that could adversely affect this sampling.
This is the class (a pig load function) that is involved in sampling - org.apache.pig.impl.builtin.RandomSampleLoader

There is room for improvement in the logic that determines the total number of samples  that are taken (instead of the arbitrary 100 records/map-task). But this seems to work well for most cases.

-Thejas
On 5/4/11 4:08 PM, "Brock Noland" <[EMAIL PROTECTED]> wrote:

Hello,

I am curious as to how PIG implements sampling for order by:

http://ofps.oreilly.com/titles/9781449302641/intro_pig_latin.html#order_by

Are there things I could when storing my data which would adversely
affect this sampling?

Thanks,
Brock

--