Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig, mail # user - Limit vs Sample


Copy link to this message
-
Re: Limit vs Sample
Prasanth J 2013-02-28, 10:08
Sorry, I was confused with RandomSampleLoader which uses reservoir sampling.
SAMPLE is rewritten to filter + less than expression with sampling percentage as predicate value.

Thanks
-- Prasanth

On Feb 28, 2013, at 5:01 AM, Gianmarco De Francisci Morales <[EMAIL PROTECTED]> wrote:

> Hi,
> LIMIT takes the first X records, so there are no statistical guarantees.
> SAMPLE takes X% of the records from the whole bag (uniformly), so you have
> statistical guarantees.
> No, SAMPLE does not use reservoir sampling.
>
> Cheers,
>
> --
> Gianmarco
>
>
> On Wed, Feb 27, 2013 at 12:23 AM, Prasanth J <[EMAIL PROTECTED]>wrote:
>
>> AFAIK, SAMPLE operator internally uses reservoir sampling. So it reads
>> entire data to randomly generate 10% data.
>>
>> Thanks
>> -- Prasanth
>>
>> On Feb 26, 2013, at 6:19 PM, Panshul Whisper <[EMAIL PROTECTED]>
>> wrote:
>>
>>> Hello,
>>>
>>> Can somebody please explain me the difference between Limit and Sample
>>> statements.
>>> Does it read the entire input file in case of Sample if the value is set
>> to
>>> 0.1 or it reads randomly only till 10% of the data has been collected.
>>>
>>> Thanking You for any help.
>>>
>>> --
>>> Regards,
>>> Ouch Whisper
>>> 010101010101
>>
>>