Home | About | Sematext search-lucene.com search-hadoop.com
 Search Hadoop and all its subprojects:

Switch to Threaded View
Pig >> mail # user >> Limit vs Sample


Hi,
LIMIT takes the first X records, so there are no statistical guarantees.
SAMPLE takes X% of the records from the whole bag (uniformly), so you have
statistical guarantees.
No, SAMPLE does not use reservoir sampling.

Cheers,

--
Gianmarco
On Wed, Feb 27, 2013 at 12:23 AM, Prasanth J <[EMAIL PROTECTED]>wrote:

> AFAIK, SAMPLE operator internally uses reservoir sampling. So it reads
> entire data to randomly generate 10% data.
>
> Thanks
> -- Prasanth
>
> On Feb 26, 2013, at 6:19 PM, Panshul Whisper <[EMAIL PROTECTED]>
> wrote:
>
> > Hello,
> >
> > Can somebody please explain me the difference between Limit and Sample
> > statements.
> > Does it read the entire input file in case of Sample if the value is set
> to
> > 0.1 or it reads randomly only till 10% of the data has been collected.
> >
> > Thanking You for any help.
> >
> > --
> > Regards,
> > Ouch Whisper
> > 010101010101
>
>